Model-Free Robust φ𝜑\varphiitalic_φ-Divergence Reinforcement Learning
Using Both Offline and Online Data

Kishan Panaganti,  Adam Wierman,  Eric Mazumdar
Computing + Mathematical Sciences Department, California Institute of Technology
Emails:{kpb, adamw, mazumdar}@caltech.edu
Abstract

The robust φ𝜑\varphiitalic_φ-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work 111To appear in the proceedings of the International Conference on Machine Learning (ICML) 2024. makes two important contributions. First, we propose a model-free algorithm called Robust φ𝜑\varphiitalic_φ-regularized fitted Q-iteration (RPQ) for learning an ε𝜀\varepsilonitalic_ε-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of φ𝜑\varphiitalic_φ-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust φ𝜑\varphiitalic_φ-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust φ𝜑\varphiitalic_φ-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.

Keywords: Robust reinforcement learning, model uncertainty, general function approximation

1 Introduction

Online Reinforcement Learning (RL) agents learn through online interactions and exploration in environments and have been shown to perform well in structured domains such as Chess and Go (Silver et al.,, 2018), fast chip placements in semiconductors (Mirhoseini et al.,, 2021), fast transform computations in mathematics (Fawzi et al.,, 2022), and more. However, online RL agents (Botvinick et al.,, 2019) are known to suffer sample inefficiency due to complex exploration strategies in sophisticated environments. To overcome this, learning from available historical data has been studied using offline RL protocols (Levine et al.,, 2020). However, offline RL agents suffer from out-of-data-distribution (Yang et al.,, 2021; Robey et al.,, 2020) due to the lack of online exploration. Recent work Song et al., (2023) proposes another learning setting called hybrid RL that makes the best of both offline and online RL worlds. In particular, hybrid RL agents have access to both offline data (to reduce exploration overhead) and online interaction with the environment (to mitigate the out-of-data-distribution issue).

All three of these approaches (online, offline, and hybrid RL) require training environments (simulators) that closely represent real-world environments. However, time-varying real-world environments (Maraun,, 2016), sensor degradations (Chen et al.,, 1996), and other adversarial disturbances in practice (Pioch et al.,, 2009) mean that even high-fidelity simulators are not enough (Schmidt et al.,, 2015; Shah et al.,, 2018). RL agents are known to fail due to these mismatches between training and testing environments (Sünderhauf et al.,, 2018; Lesort et al.,, 2020). As a result, robust RL (Mankowitz et al.,, 2020; Panaganti and Kalathil, 2021a, ) has received increasing attention due to the potential for it to alleviate the issue of mismatches between the simulator and real-world environments.

Robust RL agents are built using the robust Markov Decision Process (RMDP) (Iyengar,, 2005; Nilim and El Ghaoui,, 2005) framework. In this framework, the goal is to find an optimal policy that is robust, i.e., performs uniformly well across a set of models (transition probability functions). This is formulated via a max-min problem, and the set of models is typically constructed around a simulator model (transition probability function) with some notion of divergence or distance function. We refer to the simulator model as any nominal model that is provided to RL agents.

The RMDP framework in RL is identical to the Distributionally Robust Optimization (DRO) framework in supervised learning (Duchi and Namkoong,, 2018; Chen et al.,, 2020). Similar to RMDP, DRO is a min-max problem aiming to minimize a loss function uniformly over the set of distributions constructed around the training distribution of the input space. However, developing model-free algorithms for DRO problems with general φ𝜑\varphiitalic_φ-divergences (see Eq. 1) is known to be hard (Namkoong and Duchi,, 2016) due to their inherent non-linear and multi-level optimization structure. Additionally, developing model-free robust RL agents is also challenging (Iyengar,, 2005; Duchi and Namkoong,, 2018) for high-dimensional sequential decision-making systems under general function approximation.

To overcome this issue, in this work, we develop robust RL agents for the RRMDP framework, which is an equivalent alternative form of RMDP. A natural φ𝜑\varphiitalic_φ-divergence regularization extension to the problem of RMDP gives way for this new RRMDP framework introduced in Yang et al., (2023); Zhang et al., (2023), under different names. It is built upon the penalized DRO problem (Levy et al.,, 2020; Jin et al., 2021b, ), that is, the φ𝜑\varphiitalic_φ-divergence regularization version of the DRO problem. In particular, we focus on developing an offline robust RL algorithm for a class of φ𝜑\varphiitalic_φ-divergences under the RRMDP framework with arbitrarily large state spaces, using only offline data with general function approximation. Towards this, as the first main contribution, we propose the Robust φ𝜑\varphiitalic_φ-regularized fitted Q-iteration model-free algorithm and provide its performance guarantee for a class of φ𝜑\varphiitalic_φ-divergences with a unified analysis. We refer to algorithms as model-free if they do not explicitly estimate the underlying nominal model. We address the following important (suboptimality and sample complexity) questions: What is the rate of suboptimality gap achieved between the optimal robust value and the value of RPQ policy? How many offline data samples from the nominal model are required to learn an ε𝜀\varepsilonitalic_ε-optimal robust policy? We discuss challenges and present these results in Section 2.

{adjustwidth}

-1em

Algorithm Algorithm-type Data Coverage Dataset Type Robust Suboptimality
(Panaganti et al.,, 2022, Alg.1) FQI all-policy offline TV Vmax3log(|||𝒢|)ρN1/2superscriptsubscript𝑉3𝒢𝜌superscript𝑁12\frac{V_{\max}^{3}\sqrt{\log(|\mathcal{F}||\mathcal{G}|)}}{\rho N^{1/2}}divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( | caligraphic_F | | caligraphic_G | ) end_ARG end_ARG start_ARG italic_ρ italic_N start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG
(Zhang et al.,, 2023, Alg.1) FQI all-policy offline KL λVmax2log(||)eVmax/λN1/2𝜆superscriptsubscript𝑉2superscript𝑒subscript𝑉𝜆superscript𝑁12\frac{\lambda V_{\max}^{2}\sqrt{\log(|\mathcal{F}|)}}{e^{-V_{\max}/\lambda}N^{% 1/2}}divide start_ARG italic_λ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( | caligraphic_F | ) end_ARG end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / italic_λ end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG
(Yang et al.,, 2023, Alg.2) QL uniform-policy offline Markov φ𝜑\varphiitalic_φ Vmax3log(|𝒮||𝒜|)dmin3c(λ)N1/3superscriptsubscript𝑉3𝒮𝒜superscriptsubscript𝑑3𝑐𝜆superscript𝑁13\frac{V_{\max}^{3}\sqrt{\log(|\mathcal{S}||\mathcal{A}|)}}{d_{\min}^{3}c(% \lambda)N^{1/3}}divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( | caligraphic_S | | caligraphic_A | ) end_ARG end_ARG start_ARG italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_c ( italic_λ ) italic_N start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG
RPQ (ours:Algorithm 1) FQI all-policy offline φ𝜑\varphiitalic_φ Vmax3log(|||𝒢|)c(λ)N1/2superscriptsubscript𝑉3𝒢𝑐𝜆superscript𝑁12\frac{V_{\max}^{3}\sqrt{\log(|\mathcal{F}||\mathcal{G}|)}}{c(\lambda)N^{1/2}}divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( | caligraphic_F | | caligraphic_G | ) end_ARG end_ARG start_ARG italic_c ( italic_λ ) italic_N start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG
HyTQ (ours:Algorithm 2) FQI single-policy offline + TV Vmax(λ+Vmax)log(|||𝒢|)N1/2subscript𝑉𝜆subscript𝑉𝒢superscript𝑁12\frac{V_{\max}(\lambda+V_{\max})\log(|\mathcal{F}||\mathcal{G}|)}{N^{1/2}}divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_λ + italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) roman_log ( | caligraphic_F | | caligraphic_G | ) end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG
online non-Markov
Table 1: Comparison of model-free φ𝜑\varphiitalic_φ-divergence robust RL algorithms. In the algorithm-type column, Fitted Q-Iteration (FQI) uses least-squares regression and Q-Learning (QL) uses stochastic approximation updates. In the data coverage column, uniform-policy stipulates a data-generating policy to cover the entire state-action space. all-policy is where the data-generating policy should cover the state-action space covered by all non-stationary policies, and single-policy is where it covers the state-action space covered by the optimal robust policy, on the nominal model. denotes the coverage should include all the models in robust sets designed by the divergences in the robust column. The dataset type column mentions the type of dataset collected with a data-generating policy for training corresponding algorithms where offline indicates i.i.d. historical dataset on the nominal model, offline Markov indicates Markovian dataset induced on the nominal model, and online non-Markov indicates a history dependent dataset as a collection of Markovian datasets induced on the nominal model by a set of learned policies. Finally, the suboptimality column is the statistical upper bound for the difference between the optimal robust value and the robust value achieved by the algorithm. Here Vmaxsubscript𝑉V_{\max}italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is either H𝐻Hitalic_H or (1γ)1superscript1𝛾1(1-\gamma)^{-1}( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT effective horizon factors. ρ𝜌\rhoitalic_ρ is the robustness radius parameter in RMDPs and λ𝜆\lambdaitalic_λ is the robustness penalization parameter in RRMDPs, which are inversely related (Yang et al.,, 2023, Theorem 3.1). c(λ)𝑐𝜆c(\lambda)italic_c ( italic_λ ) is some function on λ𝜆\lambdaitalic_λ that varies according to different φ𝜑\varphiitalic_φ-divergences. N𝑁Nitalic_N is the dataset size used by algorithms. The bound of HyTQ is not directly comparable with others in terms of Vmaxsubscript𝑉V_{\max}italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT since the non-stationary finite-horizon setting requires H𝐻Hitalic_H multiplicity in dataset size. dminsubscript𝑑d_{\min}italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT is the minimal positive value of data generating stationary distribution d𝑑ditalic_d, i.e. mins,ad(s,a)subscript𝑠𝑎𝑑𝑠𝑎\min_{s,a}d(s,a)roman_min start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_d ( italic_s , italic_a ). \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G are two function representations, and (𝒮,𝒜)𝒮𝒜(\mathcal{S},\mathcal{A})( caligraphic_S , caligraphic_A ) is the state-action space.

In this work, we also develop and study a novel hybrid robust RL algorithm under the RRMDP framework using both offline data and online interactions with the nominal model. We make this second main contribution to this work since hybrid RL overcomes the out-of-data-distribution issue in offline RL. Towards this, we propose the Hybrid robust Total-variation-regularized Q-iteration algorithm and provide its performance guarantee under improved assumptions. Notably, the offline data-generating distribution must only cover the distribution that the optimal robust policy samples out on the nominal model, whereas before we needed it to cover any distribution uniformly. This is how online interactions help mitigate the out-of-data-distribution issue of offline RL and offline robust RL. We now address the cumulative suboptimality question in addition to sample complexity: What is the rate of cumulative suboptimality gap achieved between the optimal robust value and the value of HyTQ iteration policies? We discuss challenges and present these results in Section 3.

Related Work. Among all the previous works that provide model-free methods, here we only mention the ones closest to ours. We discuss more related works in Appendix A. Panaganti et al., (2022) proposed a Q-iteration offline robust RL algorithm in the RMDP framework only for the total variation φ𝜑\varphiitalic_φ-divergence. Bruns-Smith and Zhou, (2023) proposed a Q-iteration offline robust RL algorithm in the RMDP framework to solve causal inference under unobserved confounders. Zhou et al., (2023) proposed an actor-critic robust RL algorithm in RMDP for integral probability metric. Zhang et al., (2023) proposed a Q-iteration offline robust RL algorithm in the RRMDP framework only for the Kullback-Leibler φ𝜑\varphiitalic_φ-divergence. Blanchet et al., (2023) proposed specialized robust RL algorithms for the total variation and Kullback-Leibler φ𝜑\varphiitalic_φ-divergences offering unified analyses for linear, kernels, and factored function approximation models under the finite state-action setting. Other line of work (Liu et al.,, 2022; Liang et al.,, 2023; Wang et al., 2023a, ; Wang et al., 2023b, ; Yang et al.,, 2023) provide model-free robust RL algorithms based on classical Q-learning methods in finite state-action spaces. We provide more insightful comparisons in Table 1. To the best of our knowledge, this is the first work that addresses a wide class of robust RL problems (like the general φ𝜑\varphiitalic_φ-divergence) with arbitrary large state space using general function approximation under mild assumptions (like the robust Bellman error transfer coefficient).

Notation. We use the equality sign (=) for pointwise equality in vectors and matrices. For any x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R, let (x)+=max{x,0}subscript𝑥𝑥0(x)_{+}=\max\{x,0\}( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_max { italic_x , 0 }. For any vector x𝑥xitalic_x and positive semidefinite matrix A𝐴Aitalic_A, the squared matrix norm is xA2=xAxsuperscriptsubscriptnorm𝑥𝐴2superscript𝑥top𝐴𝑥\|x\|_{A}^{2}=x^{\top}Ax∥ italic_x ∥ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_A italic_x. The set of probability distributions over 𝒳𝒳\mathcal{X}caligraphic_X, with cardinality |𝒳|𝒳|\mathcal{X}|| caligraphic_X |, is denoted as Δ(𝒳)Δ𝒳\Delta(\mathcal{X})roman_Δ ( caligraphic_X ), and its power set sigma algebra as Σ(𝒳)Σ𝒳\Sigma(\mathcal{X})roman_Σ ( caligraphic_X ). For any function f𝑓fitalic_f that takes (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) as input, define the expectation w.r.t. the dataset 𝒟𝒟\mathcal{D}caligraphic_D (or empirical expectation) as 𝔼𝒟[f(si,ai,ri,si)]=1N(si,ai,ri,si)𝒟f(si,ai,ri,si)subscript𝔼𝒟delimited-[]𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝑟𝑖subscriptsuperscript𝑠𝑖1𝑁subscriptsubscript𝑠𝑖subscript𝑎𝑖subscript𝑟𝑖subscriptsuperscript𝑠𝑖𝒟𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝑟𝑖subscriptsuperscript𝑠𝑖\mathbb{E}_{\mathcal{D}}[f(s_{i},a_{i},r_{i},s^{\prime}_{i})]=\frac{1}{N}\sum_% {(s_{i},a_{i},r_{i},s^{\prime}_{i})\in\mathcal{D}}f(s_{i},a_{i},r_{i},s^{% \prime}_{i})blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). For any positive integer H𝐻Hitalic_H, set [H]delimited-[]𝐻[H][ italic_H ] denotes {0,1,,H1}01𝐻1\{0,1,\cdots,H-1\}{ 0 , 1 , ⋯ , italic_H - 1 }. Define 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norms as x2,μ=𝔼μ[x2]subscriptnorm𝑥2𝜇subscript𝔼𝜇delimited-[]superscript𝑥2\left\|x\right\|_{2,\mu}=\sqrt{\mathbb{E}_{\mu}[x^{2}]}∥ italic_x ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT = square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG and x1,μ=𝔼μ[|x|]subscriptnorm𝑥1𝜇subscript𝔼𝜇delimited-[]𝑥\left\|x\right\|_{1,\mu}=\mathbb{E}_{\mu}[|x|]∥ italic_x ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ | italic_x | ]. pqmuch-less-than𝑝𝑞p\ll qitalic_p ≪ italic_q denotes a probability distribution p𝑝pitalic_p is absolutely continuous w.r.t a probability distribution q𝑞qitalic_q. We use 𝒪()𝒪\mathcal{O}(\cdot)caligraphic_O ( ⋅ ) to ignore universal constants less than 300300300300 and 𝒪~()~𝒪\widetilde{\mathcal{O}}(\cdot)over~ start_ARG caligraphic_O end_ARG ( ⋅ ) to ignore universal constants less than 300300300300 and the polylog terms depending on problem parameters.

2 Offline Robust φ𝜑\varphiitalic_φ-Regularized Reinforcement Learning

We start with preliminaries and the problem formulation.

Infinite-Horizon Markov Decision Process: An infinite-horizon discounted Markov Decision Process (γ𝛾\gammaitalic_γMDP) is a tuple (𝒮,𝒜,R,P,γ,d0)𝒮𝒜𝑅𝑃𝛾subscript𝑑0(\mathcal{S},\mathcal{A},R,P,\gamma,d_{0})( caligraphic_S , caligraphic_A , italic_R , italic_P , italic_γ , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) where 𝒮𝒮\mathcal{S}caligraphic_S is a countably large state-space, 𝒜𝒜\mathcal{A}caligraphic_A is a finite set of actions, R:𝒮×𝒜[0,1]:𝑅𝒮𝒜01R:\mathcal{S}\times\mathcal{A}\to[0,1]italic_R : caligraphic_S × caligraphic_A → [ 0 , 1 ] is a known stochastic reward function, PΔ(𝒮)|𝒮||𝒜|𝑃Δsuperscript𝒮𝒮𝒜P\in\Delta(\mathcal{S})^{|\mathcal{S}||\mathcal{A}|}italic_P ∈ roman_Δ ( caligraphic_S ) start_POSTSUPERSCRIPT | caligraphic_S | | caligraphic_A | end_POSTSUPERSCRIPT is a probability transition function describing an environment, γ𝛾\gammaitalic_γ is a discount factor, and d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the starting state distribution. A stationary (stochastic) policy π:𝒮Δ(𝒜):𝜋𝒮Δ𝒜\pi:\mathcal{S}\to\Delta(\mathcal{A})italic_π : caligraphic_S → roman_Δ ( caligraphic_A ) specifies a distribution over actions in each state. We denote the transition dynamic distribution at state-action (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) as Ps,aΔ(𝒮)subscript𝑃𝑠𝑎Δ𝒮P_{s,a}\in\Delta(\mathcal{S})italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ). For convenience, we write r(s,a)=𝔼rR(s,a)[r]𝑟𝑠𝑎subscript𝔼similar-to𝑟𝑅𝑠𝑎delimited-[]𝑟r(s,a)=\mathbb{E}_{r\sim R(s,a)}[r]italic_r ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_r ∼ italic_R ( italic_s , italic_a ) end_POSTSUBSCRIPT [ italic_r ] and assume it is deterministic as in RL literature (Agarwal et al.,, 2019) since the performance guarantee will be identical up to a constant factor.

The value function of a policy π𝜋\piitalic_π is VP,rπ(s)=𝔼P,π[t=0γtr(st,at)|s0=s]subscriptsuperscript𝑉𝜋𝑃𝑟𝑠subscript𝔼𝑃𝜋delimited-[]conditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝑠V^{\pi}_{P,r}(s)=\mathbb{E}_{P,\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}% )\;|\;s_{0}=s]italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r end_POSTSUBSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_P , italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s ] starting at state s0=ssubscript𝑠0𝑠s_{0}=sitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s and atπ(st),st+1Pst,atformulae-sequencesimilar-tosubscript𝑎𝑡𝜋subscript𝑠𝑡similar-tosubscript𝑠𝑡1subscript𝑃subscript𝑠𝑡subscript𝑎𝑡a_{t}\sim\pi(s_{t}),s_{t+1}\sim P_{s_{t},a_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all t0𝑡0t\geq 0italic_t ≥ 0. Similarly, we define an action-value function of a policy π𝜋\piitalic_π as QP,rπ(s,a)=𝔼P,π[t=0γtr(st,at)|s0=s,a0=a].subscriptsuperscript𝑄𝜋𝑃𝑟𝑠𝑎subscript𝔼𝑃𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝑠subscript𝑎0𝑎Q^{\pi}_{P,r}(s,a)=\mathbb{E}_{P,\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{% t})\;|\;s_{0}=s,a_{0}=a].italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r end_POSTSUBSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_P , italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a ] . Each policy π𝜋\piitalic_π induces a discounted occupancy density over state-action pairs dPπ:𝒮×𝒜[0,1]:subscriptsuperscript𝑑𝜋𝑃𝒮𝒜01d^{\pi}_{P}:\mathcal{S}\times\mathcal{A}\to[0,1]italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , 1 ] defined as dPπ(s,a)=(1γ)t=0γtPt(st=s,at=a;π)subscriptsuperscript𝑑𝜋𝑃𝑠𝑎1𝛾superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑃𝑡formulae-sequencesubscript𝑠𝑡𝑠subscript𝑎𝑡𝑎𝜋d^{\pi}_{P}(s,a)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}P_{t}(s_{t}=s,a_{t}=a;\pi)italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_s , italic_a ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ; italic_π ), where Pt(st=s,at=a;π)subscript𝑃𝑡formulae-sequencesubscript𝑠𝑡𝑠subscript𝑎𝑡𝑎𝜋P_{t}(s_{t}=s,a_{t}=a;\pi)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ; italic_π ) denotes the visitation probability of state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) at time step t𝑡titalic_t, starting at s0d0()similar-tosubscript𝑠0subscript𝑑0s_{0}\sim d_{0}(\cdot)italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) and following π𝜋\piitalic_π on the model P𝑃Pitalic_P. The optimal policy πPsubscriptsuperscript𝜋𝑃\pi^{*}_{P}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT achieves the maximum value of any policy VP,rπsubscriptsuperscript𝑉𝜋𝑃𝑟V^{\pi}_{P,r}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r end_POSTSUBSCRIPT.

Offline Reinforcement Learning: The goal of offline RL on γ𝛾\gammaitalic_γMDP (Po,r)superscript𝑃𝑜𝑟(P^{o},r)( italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r ) is to learn a good policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG (a policy with a high VPo,rπ^subscriptsuperscript𝑉^𝜋superscript𝑃𝑜𝑟V^{\hat{\pi}}_{P^{o},r}italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT) based only on the offline dataset. An offline dataset is a historical and fixed dataset of interactions 𝒟Po={(si,ai,si)}i=1Nsubscript𝒟superscript𝑃𝑜superscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖superscriptsubscript𝑠𝑖𝑖1𝑁\mathcal{D}_{P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})\}_{i=1}^{N}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where siPsi,aiosimilar-tosuperscriptsubscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖s_{i}^{\prime}\sim P^{o}_{s_{i},a_{i}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the (si,ai)subscript𝑠𝑖subscript𝑎𝑖(s_{i},a_{i})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) pairs are independently and identically generated according to a data distribution μΔ(𝒮×𝒜)𝜇Δ𝒮𝒜\mu\in\Delta(\mathcal{S}\times\mathcal{A})italic_μ ∈ roman_Δ ( caligraphic_S × caligraphic_A ). For convenience, μ𝜇\muitalic_μ also denotes the offline/behavior policy that generates 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. One classical offline RL algorithm with general function approximation capabilities with provable performance guarantees is Fitted Q-Iteration (FQI) (Szepesvári and Munos,, 2005; Chen and Jiang,, 2019; Liu et al.,, 2020). A function class ={f:𝒮×𝒜[0,1/(1γ)]}conditional-set𝑓𝒮𝒜011𝛾\mathcal{F}=\{f:\mathcal{S}\times\mathcal{A}\to[0,1/(1-\gamma)]\}caligraphic_F = { italic_f : caligraphic_S × caligraphic_A → [ 0 , 1 / ( 1 - italic_γ ) ] } (e.g., neural networks, kernel functions, linear functions, etc) represents Q𝑄Qitalic_Q-value functions of γ𝛾\gammaitalic_γMDP (Po,r)superscript𝑃𝑜𝑟(P^{o},r)( italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r ). At each iteration, given fksubscript𝑓𝑘f_{k}\in\mathcal{F}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_F and 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, FQI does the following least-square regression for the approximate squared Bellman error: fk+1=argminf𝔼𝒟Po[(yfkf)2]subscript𝑓𝑘1subscriptargmin𝑓subscript𝔼subscript𝒟superscript𝑃𝑜delimited-[]superscriptsubscript𝑦subscript𝑓𝑘𝑓2f_{k+1}=\operatorname*{arg\,min}_{f\in\mathcal{F}}\mathbb{E}_{\mathcal{D}_{P^{% o}}}[(y_{f_{k}}-f)^{2}]italic_f start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_y start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_f ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where yfk(s,a,s)=r(s,a)+γmaxbfk(s,b)subscript𝑦subscript𝑓𝑘𝑠𝑎superscript𝑠𝑟𝑠𝑎𝛾subscript𝑏subscript𝑓𝑘superscript𝑠𝑏y_{f_{k}}(s,a,s^{\prime})=r(s,a)+\gamma\max_{b}f_{k}(s^{\prime},b)italic_y start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_r ( italic_s , italic_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ). In this regression step, FQI aims to find the optimal action-value QPo,rπsubscriptsuperscript𝑄superscript𝜋superscript𝑃𝑜𝑟Q^{\pi^{*}}_{P^{o},r}italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT by approximating the non-robust squared Bellman error (r+γ𝔼PoVPo,rπ()QPo,rπ2,μ2superscriptsubscriptnorm𝑟𝛾subscript𝔼superscript𝑃𝑜subscriptsuperscript𝑉superscript𝜋superscript𝑃𝑜𝑟subscriptsuperscript𝑄superscript𝜋superscript𝑃𝑜𝑟2𝜇2\|r+\gamma\mathbb{E}_{P^{o}}V^{\pi^{*}}_{P^{o},r}(\cdot)-Q^{\pi^{*}}_{P^{o},r}% \|_{2,\mu}^{2}∥ italic_r + italic_γ blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ( ⋅ ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) using offline data 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT with function approximation \mathcal{F}caligraphic_F. Finally, for some starting state s0d0similar-tosubscript𝑠0subscript𝑑0s_{0}\sim d_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the performance guarantee of an algorithm policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG is given by bounding the suboptimality quantity 0VPo,rπ(s0)VPo,rπ^(s0)0subscriptsuperscript𝑉superscript𝜋superscript𝑃𝑜𝑟subscript𝑠0subscriptsuperscript𝑉^𝜋superscript𝑃𝑜𝑟subscript𝑠00\leq V^{\pi^{*}}_{P^{o},r}(s_{0})-V^{\hat{\pi}}_{P^{o},r}(s_{0})0 ≤ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Infinite-Horizon Robust φ𝜑\varphiitalic_φ-Regularized Markov Decision Process: Let Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT be the nominal model, that is, a probability transition function describing a training environment. An infinite-horizon discounted Robust φ𝜑\varphiitalic_φ-Regularized Markov Decision Process (γ𝛾\gammaitalic_γRRMDP) tuple (𝒮,𝒜,r,Po,λ,γ,φ,d0)𝒮𝒜𝑟superscript𝑃𝑜𝜆𝛾𝜑subscript𝑑0(\mathcal{S},\mathcal{A},r,P^{o},\lambda,\gamma,\varphi,d_{0})( caligraphic_S , caligraphic_A , italic_r , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ , italic_γ , italic_φ , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) where λ>0𝜆0\lambda>0italic_λ > 0 is a robustness parameter and φ::𝜑\varphi:\mathbb{R}\to\mathbb{R}italic_φ : blackboard_R → blackboard_R is a convex function. The robust regularized reward function is defined as rPλ(s,a)=r(s,a)+λγDφ(Ps,a,Ps,ao)subscriptsuperscript𝑟𝜆𝑃𝑠𝑎𝑟𝑠𝑎𝜆𝛾subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎r^{\lambda}_{P}(s,a)=r(s,a)+\lambda\gamma D_{\varphi}(P_{s,a},P^{o}_{s,a})italic_r start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_r ( italic_s , italic_a ) + italic_λ italic_γ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) for any state-action pairs and any P𝑃Pitalic_P such that Ps,a,Ps,aosubscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎P_{s,a},P^{o}_{s,a}italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT. Here Dφsubscript𝐷𝜑D_{\varphi}italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT is the φ𝜑\varphiitalic_φ-divergence (Csiszár,, 1967) defined as

Dφ(p,q)=φ(dpdq)dqsubscript𝐷𝜑𝑝𝑞𝜑d𝑝d𝑞differential-d𝑞\displaystyle D_{\varphi}(p,q)=\int\varphi\left(\frac{\mathrm{d}p}{\mathrm{d}q% }\right)\mathrm{d}qitalic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_p , italic_q ) = ∫ italic_φ ( divide start_ARG roman_d italic_p end_ARG start_ARG roman_d italic_q end_ARG ) roman_d italic_q (1)

for two probability distributions p𝑝pitalic_p and q𝑞qitalic_q with pqmuch-less-than𝑝𝑞p\ll qitalic_p ≪ italic_q, where φ𝜑\varphiitalic_φ is convex on \mathbb{R}blackboard_R and differentiable on +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT satisfying φ(1)=0𝜑10\varphi(1)=0italic_φ ( 1 ) = 0 and φ(t)=+𝜑𝑡\varphi(t)=+\inftyitalic_φ ( italic_t ) = + ∞ for t<0𝑡0t<0italic_t < 0. Examples of φ𝜑\varphiitalic_φ-divergence include Total Variation (TV), Kullback-Leibler (KL), chi-square, Conditional Value at Risk (CVaR), and more (c.f. Proposition 3). The robust regularized value function of a policy π𝜋\piitalic_π is defined as

Vλπ=infP𝒫VP,rPλπ,subscriptsuperscript𝑉𝜋𝜆subscriptinfimum𝑃𝒫subscriptsuperscript𝑉𝜋𝑃subscriptsuperscript𝑟𝜆𝑃\displaystyle V^{\pi}_{\lambda}=\inf_{P\in\mathcal{P}}V^{\pi}_{P,r^{\lambda}_{% P}},italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = roman_inf start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (2)

where 𝒫=s,a𝒫s,a\mathcal{P}=\otimes_{s,a}\mathcal{P}_{s,a}caligraphic_P = ⊗ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT and 𝒫s,a={Ps,aΔ(𝒮):Ps,aPs,ao,(s,a)𝒮×𝒜}subscript𝒫𝑠𝑎conditional-setsubscript𝑃𝑠𝑎Δ𝒮formulae-sequencemuch-less-thansubscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎for-all𝑠𝑎𝒮𝒜\mathcal{P}_{s,a}=\{P_{s,a}\in\Delta(\mathcal{S}):P_{s,a}\ll P^{o}_{s,a},% \forall(s,a)\in\mathcal{S}\times\mathcal{A}\}caligraphic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ) : italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , ∀ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A }. By definition, for any π𝜋\piitalic_π, it follows that VλπVPo,rπ1/(1γ)subscriptsuperscript𝑉𝜋𝜆subscriptsuperscript𝑉𝜋superscript𝑃𝑜𝑟11𝛾V^{\pi}_{\lambda}\leq V^{\pi}_{P^{o},r}\leq 1/(1-\gamma)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ≤ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ≤ 1 / ( 1 - italic_γ ). The optimal robust regularized value function is Vλ=maxπVλπsubscriptsuperscript𝑉𝜆subscript𝜋subscriptsuperscript𝑉𝜋𝜆V^{*}_{\lambda}=\max_{\pi}V^{\pi}_{\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT (similarly we can design Qλsubscriptsuperscript𝑄𝜆Q^{*}_{\lambda}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT), and πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the robust regularized optimal policy that achieves this optimal value. For convenience, we denote Vλsubscriptsuperscript𝑉𝜆V^{*}_{\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT(Qλsubscriptsuperscript𝑄𝜆Q^{*}_{\lambda}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT) as Vsuperscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT(Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT). We note that 𝒫𝒫\mathcal{P}caligraphic_P satisfies the (s,a)𝑠𝑎(s,a)( italic_s , italic_a )-rectangularity condition (Iyengar,, 2005) by definition. This is a sufficient condition for the optimization problem in (2) to be tractable. It also enables the existence of a deterministic policy for πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (Yang et al.,, 2023). We formally mention this in Proposition 5. For any policy π𝜋\piitalic_π, denote Vπ=𝔼sd0[Vπ(s)]superscript𝑉𝜋subscript𝔼similar-to𝑠subscript𝑑0delimited-[]superscript𝑉𝜋𝑠V^{\pi}=\mathbb{E}_{s\sim d_{0}}[V^{\pi}(s)]italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ] as the expected total reward with d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as initial state distribution.

Denote the robust regularized Bellman operator 𝒯:𝒮×𝒜𝒮×𝒜:𝒯superscript𝒮𝒜superscript𝒮𝒜\mathcal{T}:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\to\mathbb{R}^{\mathcal{S% }\times\mathcal{A}}caligraphic_T : blackboard_R start_POSTSUPERSCRIPT caligraphic_S × caligraphic_A end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT caligraphic_S × caligraphic_A end_POSTSUPERSCRIPT as

(𝒯Q)(s,a)=r(s,a)+γinfPs,a𝒫s,a(𝔼sPs,a[maxaQ(s,a)]+λDφ(Ps,a,Ps,ao)).𝒯𝑄𝑠𝑎𝑟𝑠𝑎𝛾subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑎𝑄superscript𝑠superscript𝑎𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle(\mathcal{T}Q)(s,a)=r(s,a)+\gamma\inf_{P_{s,a}\in\mathcal{P}_{s,a% }}\big{(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})\big{)}.( caligraphic_T italic_Q ) ( italic_s , italic_a ) = italic_r ( italic_s , italic_a ) + italic_γ roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ) . (3)

Since 𝒯𝒯\mathcal{T}caligraphic_T is a contraction (Yang et al.,, 2023), the robust Q-iteration (RQI) Qk+1=𝒯Qksubscript𝑄𝑘1𝒯subscript𝑄𝑘Q_{k+1}=\mathcal{T}Q_{k}italic_Q start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = caligraphic_T italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT converges to Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We get the robust optimal policy as π(s)=argmaxaQ(s,a)superscript𝜋𝑠subscriptargmax𝑎superscript𝑄𝑠𝑎\pi^{*}(s)=\operatorname*{arg\,max}_{a}Q^{*}(s,a)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ).

2.1 Problem Conceptualization

In this section, we study the offline infinite-horizon robust φ𝜑\varphiitalic_φ-regularized RL (γ𝛾\gammaitalic_γR3L) problem, acquiring useful insights to construct our algorithm (Algorithm 1) in next section. The goal here is to learn a good robust policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG (a policy with a high Vλπ^subscriptsuperscript𝑉^𝜋𝜆V^{\hat{\pi}}_{\lambda}italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT) based on the offline dataset. We start by noting one key challenge in the estimation of the robust regularized Bellman operator 𝒯𝒯\mathcal{T}caligraphic_T (3): One may require many offline datasets from each P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P to achieve our offline γ𝛾\gammaitalic_γR3L goal. In this work, we use the penalized Distributionally Robust Optimization (DRO) tool (Sinha et al.,, 2017; Levy et al.,, 2020; Jin et al., 2021b, ) to not require such unrealistic existence of offline datasets. In particular, as in non-robust offline RL, we only rely on the offline dataset 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT generated on the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT by an offline policy μ𝜇\muitalic_μ. This statement is justified via the following proposition.

Proposition 1.

Consider a robust φ𝜑\varphiitalic_φ-regularized MDP. For any Q:𝒮×𝒜[0,1/(1γ)]:𝑄𝒮𝒜011𝛾Q:\mathcal{S}\times\mathcal{A}\to[0,1/(1-\gamma)]italic_Q : caligraphic_S × caligraphic_A → [ 0 , 1 / ( 1 - italic_γ ) ], the robust regularized Bellman operator 𝒯𝒯\mathcal{T}caligraphic_T (3) can be equivalently written as

(𝒯Q)(s,a)𝒯𝑄𝑠𝑎\displaystyle(\mathcal{T}Q)(s,a)( caligraphic_T italic_Q ) ( italic_s , italic_a ) =r(s,a)γinfηΘ(λ𝔼sPs,ao[φ((ηV(s))/λ)]η),absent𝑟𝑠𝑎𝛾subscriptinfimum𝜂Θ𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑𝜂𝑉superscript𝑠𝜆𝜂\displaystyle=r(s,a)-\gamma\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{\prime}% \sim P^{o}_{s,a}}[\varphi^{*}\left({(\eta-V(s^{\prime}))}/{\lambda}\right)]-% \eta),= italic_r ( italic_s , italic_a ) - italic_γ roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) ] - italic_η ) , (4)

where V(s)=maxa𝒜Q(s,a)𝑉𝑠subscript𝑎𝒜𝑄𝑠𝑎V(s)=\max_{a\in\mathcal{A}}Q(s,a)italic_V ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a ) and ΘΘ\Theta\subset\mathbb{R}roman_Θ ⊂ blackboard_R is some bounded real line which depends on φsuperscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

A proof of this proposition is given in Appendix D and follows from Levy et al., (2020, Section A.1.2). We refer to (4) as the robust regularized Bellman dual operator. Observing the sole dependence on the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT in (4), one can come up with estimators for data-driven approaches that naturally depend only on the dataset 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. We remark that we consider a class of φ𝜑\varphiitalic_φ-divergences satisfying the conditions in Proposition 3 for all the results in this paper.

We now remark on a natural first attempt at performing the squared Bellman error least-square regression, like FQI, on the robust regularized Bellman dual operator (4). Observe that the true Bellman error 𝔼s,aμ[|𝒯Q(s,a)Q(s,a)|]subscript𝔼similar-to𝑠𝑎𝜇delimited-[]𝒯superscript𝑄𝑠𝑎superscript𝑄𝑠𝑎\mathbb{E}_{s,a\sim\mu}[|\mathcal{T}Q^{*}(s,a)-Q^{*}(s,a)|]blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ end_POSTSUBSCRIPT [ | caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) | ] involves solving an inner convex minimization problem in 𝒯Q(s,a)𝒯superscript𝑄𝑠𝑎\mathcal{T}Q^{*}(s,a)caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) (4) for every (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). Since we are in a countably large state space regime, it is infeasible to devise approximations to this true squared Bellman error. In addition, we have to also enable general function architecture for action-values. To alleviate this challenging task, we now turn our attention to the inner convex minimization problem in the robust regularized Bellman dual operator (4). Due to the (s,a)𝑠𝑎(s,a)( italic_s , italic_a )-rectangularity assumption, we note that the η𝜂\etaitalic_η’s are not correlated across all (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). With this note, for every (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), we can replace η𝜂\etaitalic_η in (𝒯Q)(s,a)𝒯𝑄𝑠𝑎(\mathcal{T}Q)(s,a)( caligraphic_T italic_Q ) ( italic_s , italic_a ) (4) with a dual-variable function g(s,a)𝑔𝑠𝑎g(s,a)italic_g ( italic_s , italic_a ). Thus, intuitively, multiple point-wise minimizations can be replaced by a single dual-variable functional minimization over the function space of g𝑔gitalic_g. We formalize this intuition using variational functional analysis (Rockafellar and Wets,, 2009) for a countably large state space regime in the following.

We denote L1(μ)superscript𝐿1𝜇L^{1}(\mu)italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ) as the set of all absolutely integrable functions defined on the probability (measure) space (𝒮×𝒜,Σ(𝒮×𝒜),μ)𝒮𝒜Σ𝒮𝒜𝜇(\mathcal{S}\times\mathcal{A},\Sigma(\mathcal{S}\times\mathcal{A}),\mu)( caligraphic_S × caligraphic_A , roman_Σ ( caligraphic_S × caligraphic_A ) , italic_μ ) with μ𝜇\muitalic_μ, the data generating distribution, as the σ𝜎\sigmaitalic_σ-finite probability measure. To elucidate, L1(μ)superscript𝐿1𝜇L^{1}(\mu)italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ) is the set of all functions g:𝒮×𝒜𝒞:𝑔𝒮𝒜𝒞g:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{C}\subset\mathbb{R}italic_g : caligraphic_S × caligraphic_A → caligraphic_C ⊂ blackboard_R such that g1,μsubscriptnorm𝑔1𝜇\left\|g\right\|_{1,\mu}∥ italic_g ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT is finite. We set 𝒞=Θ𝒞Θ\mathcal{C}=\Thetacaligraphic_C = roman_Θ considering the inner minimization in (4). Fixing any given function f:𝒮×𝒜[0,1/(1γ)]:𝑓𝒮𝒜011𝛾f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)]italic_f : caligraphic_S × caligraphic_A → [ 0 , 1 / ( 1 - italic_γ ) ], we define the loss function Ldual(g;f)subscript𝐿dual𝑔𝑓L_{\mathrm{dual}}(g;f)italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f ), for all gL1(μ)𝑔superscript𝐿1𝜇g\in L^{1}(\mu)italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ), as

Ldual(g;f,μ)subscript𝐿dual𝑔𝑓𝜇\displaystyle L_{\mathrm{dual}}(g;f,\mu)italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f , italic_μ ) =𝔼s,aμ,sPs,ao[λφ((g(s,a)maxaf(s,a))/λ)g(s,a)].absentsubscript𝔼formulae-sequencesimilar-to𝑠𝑎𝜇similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]𝜆superscript𝜑𝑔𝑠𝑎subscriptsuperscript𝑎𝑓superscript𝑠superscript𝑎𝜆𝑔𝑠𝑎\displaystyle=\mathbb{E}_{s,a\sim\mu,s^{\prime}\sim P^{o}_{s,a}}[\lambda% \varphi^{*}((g(s,a)-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))/{\lambda})-g(s,% a)].= blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ] . (5)

We state the result for single dual-variable functional minimization intuition we developed in the previous paragraph. We also note one variant of this result appears in the distributionally robust RL work (Panaganti et al.,, 2022).

Proposition 2.

Let Ldualsubscript𝐿dualL_{\mathrm{dual}}italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT be the loss function defined in (5). Then, for any function f:𝒮×𝒜[0,1/(1γ)]:𝑓𝒮𝒜011𝛾f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)]italic_f : caligraphic_S × caligraphic_A → [ 0 , 1 / ( 1 - italic_γ ) ], we have

infgL1(μ)Ldual(g;f,μ)=𝔼s,aμ[infηΘ(λ𝔼sPs,ao[φ((ηmaxaf(s,a))/λ)]η)].subscriptinfimum𝑔superscript𝐿1𝜇subscript𝐿dual𝑔𝑓𝜇subscript𝔼similar-to𝑠𝑎𝜇delimited-[]subscriptinfimum𝜂Θ𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑𝜂subscriptsuperscript𝑎𝑓superscript𝑠superscript𝑎𝜆𝜂\displaystyle\inf_{g\in L^{1}(\mu)}L_{\mathrm{dual}}(g;f,\mu)=\mathbb{E}_{s,a% \sim\mu}\Big{[}\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s% ,a}}[\varphi^{*}({(\eta-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda})% ]-\eta)\Big{]}.roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f , italic_μ ) = blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ end_POSTSUBSCRIPT [ roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) ] - italic_η ) ] . (6)

We provide a proof in Appendix D, which relies on Rockafellar and Wets, (2009, Theorem 14.60).

For any given f:𝒮×𝒜[0,1/(1γ)]:𝑓𝒮𝒜011𝛾f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)]italic_f : caligraphic_S × caligraphic_A → [ 0 , 1 / ( 1 - italic_γ ) ] and (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in\mathcal{S}\times\mathcal{A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A, we define an operator 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, for all gL1(μ)𝑔superscript𝐿1𝜇g\in L^{1}(\mu)italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ), as

(𝒯gf)(s,a)=r(s,a)γ(λ𝔼sPs,ao[φ((g(s,a)V(s))/λ)]g(s,a)).subscript𝒯𝑔𝑓𝑠𝑎𝑟𝑠𝑎𝛾𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑𝑔𝑠𝑎𝑉superscript𝑠𝜆𝑔𝑠𝑎\displaystyle(\mathcal{T}_{g}f)(s,a)=r(s,a)-\gamma(\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{s,a}}[\varphi^{*}\left({(g(s,a)-V(s^{\prime}))}/{\lambda}% \right)]-g(s,a)).( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s , italic_a ) = italic_r ( italic_s , italic_a ) - italic_γ ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) ] - italic_g ( italic_s , italic_a ) ) . (7)

This operator is useful in view of Propositions 1 and 2. To see this, we first define g(Q)argmingL1(μ)Ldual(g;Q,μ)superscript𝑔𝑄subscriptargmin𝑔superscript𝐿1𝜇subscript𝐿dual𝑔𝑄𝜇g^{*}(Q)\in\operatorname*{arg\,min}_{g\in L^{1}(\mu)}L_{\mathrm{dual}}(g;Q,\mu)italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Q ) ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_Q , italic_μ ) for any action-value function Q𝑄Qitalic_Q. Now, by taking an expectation w.r.t the data generating distribution μ𝜇\muitalic_μ on (4), we observe 𝒯Q=𝒯g(Q)Q𝒯𝑄subscript𝒯superscript𝑔𝑄𝑄\mathcal{T}Q=\mathcal{T}_{g^{*}(Q)}Qcaligraphic_T italic_Q = caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Q ) end_POSTSUBSCRIPT italic_Q by utilizing (6). Due to this observation, in the following subsection, we develop an algorithm by approximating both the optimal dual-variable function of optimal robust value g(Q)superscript𝑔superscript𝑄g^{*}(Q^{*})italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and the robust squared Bellman error (𝒯g(Q)QQ2,μ2superscriptsubscriptnormsubscript𝒯superscript𝑔superscript𝑄superscript𝑄superscript𝑄2𝜇2\|\mathcal{T}_{g^{*}(Q^{*})}Q^{*}-Q^{*}\|_{2,\mu}^{2}∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) using offline data 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Panaganti et al., (2022) similarly conceptualized their total variation φ𝜑\varphiitalic_φ-divergence robust RL algorithm. Here, Proposition 1 enables us to conceptualize for general φ𝜑\varphiitalic_φ-divergence.

2.2 Robust φ𝜑\varphiitalic_φ-regularized fitted Q-iteration

In this section, we formally propose our algorithm based on the tools developed so far. Our proposed algorithm is called Robust φ𝜑\varphiitalic_φ-regularized fitted Q-iteration (RPQ) Algorithm and is summarized in Algorithm 1. We first discuss the inputs to our algorithm. As mentioned above, we only use the offline dataset 𝒟Po={(si,ai,si)}i=1Nsubscript𝒟superscript𝑃𝑜superscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖superscriptsubscript𝑠𝑖𝑖1𝑁\mathcal{D}_{P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})\}_{i=1}^{N}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, generated according to a data distribution μ𝜇\muitalic_μ on the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. We also consider two general function classes (f:𝒮×𝒜[0,1/(1γ)])\mathcal{F}\subset(f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)])caligraphic_F ⊂ ( italic_f : caligraphic_S × caligraphic_A → [ 0 , 1 / ( 1 - italic_γ ) ] ) and 𝒢(g:𝒮×𝒜Θ)\mathcal{G}\subset(g:\mathcal{S}\times\mathcal{A}\rightarrow\Theta)caligraphic_G ⊂ ( italic_g : caligraphic_S × caligraphic_A → roman_Θ ) representing action-value functions and dual-variable functions, respectively. We now define useful approximation quantities for g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G and f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F. For given f𝑓fitalic_f, the empirical loss function of the true loss Ldualsubscript𝐿dualL_{\mathrm{dual}}italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT Eq. 5 on 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is

L^dualsubscript^𝐿dual\displaystyle\widehat{L}_{\mathrm{dual}}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT (g;f)=𝔼𝒟Po[λφ((g(si,ai)maxaf(si,a))/λ)g(si,ai)].𝑔𝑓subscript𝔼subscript𝒟superscript𝑃𝑜delimited-[]𝜆superscript𝜑𝑔subscript𝑠𝑖subscript𝑎𝑖subscriptsuperscript𝑎𝑓superscriptsubscript𝑠𝑖superscript𝑎𝜆𝑔subscript𝑠𝑖subscript𝑎𝑖\displaystyle(g;f)=\mathbb{E}_{\mathcal{D}_{P^{o}}}[\lambda\varphi^{*}((g(s_{i% },a_{i})-\max_{a^{\prime}}f(s_{i}^{\prime},a^{\prime}))/{\lambda})-g(s_{i},a_{% i})].( italic_g ; italic_f ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] . (8)

For given f,g𝑓𝑔f,gitalic_f , italic_g, the empirical squared robust regularized Bellman error on 𝒟Posubscript𝒟superscript𝑃𝑜\mathcal{D}_{P^{o}}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is

L^robQ(Q;f,g)=𝔼𝒟Po[([r(si,ai)γλφ((g(si,ai)maxaf(si,a))/λ)+γg(si,ai)Q(si,ai))2].\displaystyle\widehat{L}_{\mathrm{robQ}}(Q;f,g)=\mathbb{E}_{\mathcal{D}_{P^{o}% }}[([r(s_{i},a_{i})-\gamma\lambda\varphi^{*}((g(s_{i},a_{i})-\max_{a^{\prime}}% f(s_{i}^{\prime},a^{\prime}))/{\lambda})+\gamma g(s_{i},a_{i})-Q(s_{i},a_{i}))% ^{2}].over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_robQ end_POSTSUBSCRIPT ( italic_Q ; italic_f , italic_g ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( [ italic_r ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_γ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) + italic_γ italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_Q ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (9)

We start with an initial action-value function Q0(s,a)=0subscript𝑄0𝑠𝑎0Q_{0}(s,a)=0italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0 and execute the following two steps for K𝐾Kitalic_K iterations. At iteration k𝑘kitalic_k of the algorithm with input Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, as a first step, we compute a dual-variable function gk𝒢subscript𝑔𝑘𝒢g_{k}\in\mathcal{G}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_G through the empirical risk minimization approach, that is, we solve argming𝒢L^dual(g;Qk)subscriptargmin𝑔𝒢subscript^𝐿dual𝑔subscript𝑄𝑘\operatorname*{arg\,min}_{g\in\mathcal{G}}\widehat{L}_{\mathrm{dual}}(g;Q_{k})start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (Line 4 of Algorithm 1). As a second step, given inputs Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we compute the next iterate Qk+1subscript𝑄𝑘1Q_{k+1}\in\mathcal{F}italic_Q start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∈ caligraphic_F through the least-squares regression method, that is, we solve argminfL^robQ(f;Qk,gk)subscriptargmin𝑓subscript^𝐿robQ𝑓subscript𝑄𝑘subscript𝑔𝑘\operatorname*{arg\,min}_{f\in\mathcal{F}}\widehat{L}_{\mathrm{robQ}}(f;Q_{k},% g_{k})start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_robQ end_POSTSUBSCRIPT ( italic_f ; italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (Line 5 of Algorithm 1). After K𝐾Kitalic_K iterations, we extract the greedy policy from QKsubscript𝑄𝐾Q_{K}italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (Line 7 of Algorithm 1).

Algorithm 1 Robust φ𝜑\varphiitalic_φ-regularized fitted Q-iteration (RPQ) Algorithm
1:  Input: Regularization φ𝜑\varphiitalic_φ, offline dataset 𝒟Po=(si,ai,ri,si)i=1Nsubscript𝒟superscript𝑃𝑜superscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖subscript𝑟𝑖subscriptsuperscript𝑠𝑖𝑖1𝑁\mathcal{D}_{P^{o}}=(s_{i},a_{i},r_{i},s^{\prime}_{i})_{i=1}^{N}caligraphic_D start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, general function classes \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G
2:  Initialize: Q00subscript𝑄00Q_{0}\equiv 0\in\mathcal{F}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≡ 0 ∈ caligraphic_F.
3:  for k=0,,K1𝑘0𝐾1k=0,\cdots,K-1italic_k = 0 , ⋯ , italic_K - 1  do
4:     Dual variable function minimization: gk=g^Qk=argming𝒢L^dual(g;Qk)subscript𝑔𝑘subscript^𝑔subscript𝑄𝑘subscriptargmin𝑔𝒢subscript^𝐿dual𝑔subscript𝑄𝑘g_{k}=\widehat{g}_{Q_{k}}=\operatorname*{arg\,min}_{g\in\mathcal{G}}\widehat{L% }_{\mathrm{dual}}(g;Q_{k})\;italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (c.f. (8))
5:     Robust φ𝜑\varphiitalic_φ-regularized Q-update: Qk+1=argminQL^robQ(Q;Qk,gk)subscript𝑄𝑘1subscriptargmin𝑄subscript^𝐿robQ𝑄subscript𝑄𝑘subscript𝑔𝑘Q_{k+1}=\operatorname*{arg\,min}_{Q\in\mathcal{F}}\widehat{L}_{\mathrm{robQ}}(% Q;Q_{k},g_{k})\;italic_Q start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_Q ∈ caligraphic_F end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_robQ end_POSTSUBSCRIPT ( italic_Q ; italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (c.f. (9))
6:  end for
7:  Output: πK=argmaxaQK(s,a)subscript𝜋𝐾subscriptargmax𝑎subscript𝑄𝐾𝑠𝑎\pi_{K}=\operatorname*{arg\,max}_{a}Q_{K}(s,a)italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a )

2.3 Performance Guarantee: Suboptimality

We now discuss the performance guarantee of our RPQ Algorithm. In particular, we characterize how close the robust regularized value function of our RPQ Algorithm is to the optimal robust regularized value function. We first mention all the assumptions about the data generating distribution μ𝜇\muitalic_μ and the representation power of \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G before we present our main results.

Assumption 1 (Concentrability).

There exists a finite constant C>0𝐶0C>0italic_C > 0 such that for any ν{dπ,P|\nu\in\{d_{\pi,P}~{}|italic_ν ∈ { italic_d start_POSTSUBSCRIPT italic_π , italic_P end_POSTSUBSCRIPT | any policy π𝜋\piitalic_π and P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P satisfying Dφ(Ps,a,Ps,ao)1/(λ(1γ))subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎1𝜆1𝛾D_{\varphi}(P_{s,a},P^{o}_{s,a})\leq 1/(\lambda(1-\gamma))italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ≤ 1 / ( italic_λ ( 1 - italic_γ ) ) for all s,a𝑠𝑎s,aitalic_s , italic_a (both can be non-stationary)}Δ(𝒮×𝒜)\}\subseteq\Delta(\mathcal{S}\times\mathcal{A})} ⊆ roman_Δ ( caligraphic_S × caligraphic_A ), we have ν/μCsubscriptnorm𝜈𝜇𝐶\left\|\nu/\mu\right\|_{\infty}\leq\sqrt{C}∥ italic_ν / italic_μ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ square-root start_ARG italic_C end_ARG.

Assumption 1 stipulates the support set of the data generating distribution μ𝜇\muitalic_μ, i.e. {(s,a)𝒮×𝒜:μ(s,a)>0}conditional-set𝑠𝑎𝒮𝒜𝜇𝑠𝑎0\{(s,a)\in\mathcal{S}\times\mathcal{A}:\mu(s,a)>0\}{ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A : italic_μ ( italic_s , italic_a ) > 0 }, to cover the union of all support sets of the distributions ν𝜈\nuitalic_ν, leading to a robust exploratory behavior. This assumption is widely used in the offline RL literature (Munos,, 2003; Agarwal et al.,, 2019; Chen and Jiang,, 2019; Wang et al.,, 2021; Xie et al.,, 2021) in different forms. We adapt this assumption from the robust offline RL (Panaganti et al.,, 2022; Zhang et al.,, 2023).

Assumption 2 (Approximate Robust Bellman Completeness).

Let εsubscript𝜀\varepsilon_{\mathcal{F}}italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT be some small positive constant. For any g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, we have supfinfff𝒯gf2,μ2εsubscriptsupremum𝑓subscriptinfimumsuperscript𝑓superscriptsubscriptnormsuperscript𝑓subscript𝒯𝑔𝑓2𝜇2subscript𝜀\sup_{f\in\mathcal{F}}\inf_{f^{\prime}\in\mathcal{F}}\|f^{\prime}-\mathcal{T}_% {g}f\|_{2,\mu}^{2}\leq\varepsilon_{\mathcal{F}}roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F end_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT for the data generating distribution μ𝜇\muitalic_μ.

We note that Assumption 2 holds trivially if 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is closed under \mathcal{F}caligraphic_F, that is, for any f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F and g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, if it holds that 𝒯gfsubscript𝒯𝑔𝑓\mathcal{T}_{g}f\in\mathcal{F}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ∈ caligraphic_F, then ε=0subscript𝜀0\varepsilon_{\mathcal{F}}=0italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT = 0. This assumption has been widely used in different forms in the non-robust offline RL literature (Agarwal et al.,, 2019; Wang et al.,, 2021; Xie et al.,, 2021) and robust offline RL literature (Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023; Zhang et al.,, 2023).

Assumption 3 (Approximate Dual Realizability).

For all f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, there exists a uniform constant ε𝒢subscript𝜀𝒢\varepsilon_{\mathcal{G}}italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT such that infg𝒢Ldual(g;f)infgL1(μ)Ldual(g;f)ε𝒢subscriptinfimum𝑔𝒢subscript𝐿dual𝑔𝑓subscriptinfimum𝑔superscript𝐿1𝜇subscript𝐿dual𝑔𝑓subscript𝜀𝒢\inf_{g\in\mathcal{G}}L_{\mathrm{dual}}(g;f)-\inf_{g\in L^{1}(\mu)}L_{\mathrm{% dual}}(g;f)\leq\varepsilon_{\mathcal{G}}roman_inf start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f ) - roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f ) ≤ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT.

Assumption 3 holds trivially if g(f)𝒢superscript𝑔𝑓𝒢g^{*}(f)\in\mathcal{G}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_f ) ∈ caligraphic_G for any f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F (since ε𝒢=0subscript𝜀𝒢0\varepsilon_{\mathcal{G}}=0italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = 0). This assumption has been used in earlier robust offline RL literature (Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023).

Now we state our main theoretical result on the performance of the RPQ algorithm. In Appendix D we restate the result including the constant factors.

Theorem 1.

Let Assumptions 1, 2 and 3 hold. Let cφ(λ,γ)subscript𝑐𝜑𝜆𝛾c_{\varphi}(\lambda,\gamma)italic_c start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_λ , italic_γ ) be problem-dependent constants for φ𝜑\varphiitalic_φ. Let πKsubscript𝜋𝐾\pi_{K}italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT be the RPQ algorithm policy after K𝐾Kitalic_K iterations. Then, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

VπVπKsuperscript𝑉superscript𝜋superscript𝑉subscript𝜋𝐾absent\displaystyle V^{\pi^{*}}-V^{\pi_{K}}\leqitalic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ C(γK+6ε+γε𝒢)(1γ)2+cφ(λ,γ)(1γ)3𝒪(Clog(|||𝒢|/δ)/N).𝐶superscript𝛾𝐾6subscript𝜀𝛾subscript𝜀𝒢superscript1𝛾2subscript𝑐𝜑𝜆𝛾superscript1𝛾3𝒪𝐶𝒢𝛿𝑁\displaystyle\frac{\sqrt{C}(\gamma^{K}+\sqrt{6\varepsilon_{\mathcal{F}}}+% \gamma\varepsilon_{\mathcal{G}})}{(1-\gamma)^{2}}+\frac{c_{\varphi}(\lambda,% \gamma)}{(1-\gamma)^{3}}\mathcal{O}(\sqrt{{C\log(|\mathcal{F}||\mathcal{G}|/% \delta)}/{N}}).divide start_ARG square-root start_ARG italic_C end_ARG ( italic_γ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT + square-root start_ARG 6 italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT end_ARG + italic_γ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_λ , italic_γ ) end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG caligraphic_O ( square-root start_ARG italic_C roman_log ( | caligraphic_F | | caligraphic_G | / italic_δ ) / italic_N end_ARG ) .

Theorem 1 states that the RPQ algorithm is approximately optimal. This theorem also gives the sample complexity guarantee for finding an ε𝜀\varepsilonitalic_ε-suboptimal policy w.r.t. the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. To see this, by neglecting the first term due to inevitable function class approximation errors, for N𝒪((cφ(λ,γ))2ε2(1γ)4log|||𝒢|δ)𝑁𝒪superscriptsubscript𝑐𝜑𝜆𝛾2superscript𝜀2superscript1𝛾4𝒢𝛿N\geq\mathcal{O}(\frac{(c_{\varphi}(\lambda,\gamma))^{2}}{\varepsilon^{2}(1-% \gamma)^{4}}\log\frac{|\mathcal{F}||\mathcal{G}|}{\delta})italic_N ≥ caligraphic_O ( divide start_ARG ( italic_c start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_λ , italic_γ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG roman_log divide start_ARG | caligraphic_F | | caligraphic_G | end_ARG start_ARG italic_δ end_ARG ) we get VπVπKε/(1γ)superscript𝑉superscript𝜋superscript𝑉subscript𝜋𝐾𝜀1𝛾V^{\pi^{*}}-V^{\pi_{K}}\leq{\varepsilon}/{(1-\gamma)}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ italic_ε / ( 1 - italic_γ ) with probability at least 1δ1𝛿1-\delta1 - italic_δ for any fixed ε,δ(0,1)𝜀𝛿01\varepsilon,\delta\in(0,1)italic_ε , italic_δ ∈ ( 0 , 1 ).

Remark 1.

Note that the guarantee for the TV case in Theorem 1 requires making another assumption on the existence of a fail-state (Panaganti et al.,, 2022, Lemma 3), Assumption 8 replacing H𝐻Hitalic_H with 1/(1γ)11𝛾1/(1-\gamma)1 / ( 1 - italic_γ ). However, we specialize Theorem 1 for the TV case by relaxing Assumption 1 to get the same guarantee, which we present in Appendix D. In particular, we relax Assumption 1 to the non-robust offline RL concentrability assumption (Foster et al.,, 2022), i.e. we only need the distribution ν𝜈\nuitalic_ν to be in the collection of discounted state-action occupancies on the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT.

3 Hybrid Robust φ𝜑\varphiitalic_φ-Regularized Reinforcement Learning

In this section, we provide a hybrid robust φ𝜑\varphiitalic_φ-Regularized RL protocol to overcome the out-of-data-distribution issue in offline robust RL. As in Song et al., (2023), we reformulate the problem in the finite-horizon setting to use its backward induction feature that enables RPQ iterates to run in each episode. We again start by discussing preliminaries and the problem formulation.

Finite-Horizon Markov Decision Process: A finite-horizon Markov Decision Process (hhitalic_hMDP) is (𝒮,𝒜,P=(Ph)h=0H1,r=(rh)h=0H1,H)formulae-sequence𝒮𝒜𝑃superscriptsubscriptsubscript𝑃0𝐻1𝑟superscriptsubscriptsubscript𝑟0𝐻1𝐻(\mathcal{S},\mathcal{A},P=(P_{h})_{h=0}^{H-1},r=(r_{h})_{h=0}^{H-1},{H})( caligraphic_S , caligraphic_A , italic_P = ( italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT , italic_r = ( italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT , italic_H ), where H𝐻Hitalic_H is the horizon length, for any h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], rh:𝒮×𝒜[0,1]:subscript𝑟𝒮𝒜01r_{h}:\mathcal{S}\times\mathcal{A}\to[0,1]italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , 1 ] is a known deterministic reward function and PhΔ(𝒮)|𝒮||𝒜|subscript𝑃Δsuperscript𝒮𝒮𝒜P_{h}\in\Delta(\mathcal{S})^{|\mathcal{S}||\mathcal{A}|}italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ) start_POSTSUPERSCRIPT | caligraphic_S | | caligraphic_A | end_POSTSUPERSCRIPT is the transition probability function at time hhitalic_h. A non-stationary (stochastic) policy π=(πh)h=0H1𝜋superscriptsubscriptsubscript𝜋0𝐻1\pi=(\pi_{h})_{h=0}^{H-1}italic_π = ( italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT where πh:𝒮Δ(𝒜):subscript𝜋𝒮Δ𝒜\pi_{h}:\mathcal{S}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_A ). We denote the transition dynamic distribution at time hhitalic_h and state-action (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) as Ph,s,aΔ(𝒮)subscript𝑃𝑠𝑎Δ𝒮P_{h,s,a}\in\Delta(\mathcal{S})italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ). Given π𝜋\piitalic_π, we define the state and action value functions in the usual manner: VP,rh,π(s)=𝔼[t=hH1rt(st,at)|sh=s]subscriptsuperscript𝑉𝜋𝑃𝑟𝑠𝔼delimited-[]conditionalsuperscriptsubscript𝑡𝐻1subscript𝑟𝑡subscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑠V^{h,\pi}_{P,r}(s)=\mathbb{E}[\sum_{t=h}^{H-1}r_{t}(s_{t},a_{t})|s_{h}=s]italic_V start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r end_POSTSUBSCRIPT ( italic_s ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s ] starting at state sh=ssubscript𝑠𝑠s_{h}=sitalic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s and atπt(st),st+1Pt+1,st,atformulae-sequencesimilar-tosubscript𝑎𝑡subscript𝜋𝑡subscript𝑠𝑡similar-tosubscript𝑠𝑡1subscript𝑃𝑡1subscript𝑠𝑡subscript𝑎𝑡a_{t}\sim\pi_{t}(s_{t}),s_{t+1}\sim P_{t+1,s_{t},a_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t + 1 , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and QP,rh,π(s,a)=𝔼[t=hH1rt(st,at)|sh=s,ah=a]subscriptsuperscript𝑄𝜋𝑃𝑟𝑠𝑎𝔼delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡𝐻1subscript𝑟𝑡subscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑠subscript𝑎𝑎Q^{h,\pi}_{P,r}(s,a)=\mathbb{E}[\sum_{t=h}^{H-1}r_{t}(s_{t},a_{t})|s_{h}=s,a_{% h}=a]italic_Q start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r end_POSTSUBSCRIPT ( italic_s , italic_a ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a ] starting at state-action sh=s,ah=aformulae-sequencesubscript𝑠𝑠subscript𝑎𝑎s_{h}=s,a_{h}=aitalic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a and st+1Pt+1,st,at,at+1πt+1(st+1)formulae-sequencesimilar-tosubscript𝑠𝑡1subscript𝑃𝑡1subscript𝑠𝑡subscript𝑎𝑡similar-tosubscript𝑎𝑡1subscript𝜋𝑡1subscript𝑠𝑡1s_{t+1}\sim P_{t+1,s_{t},a_{t}},a_{t+1}\sim\pi_{t+1}(s_{t+1})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t + 1 , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). Given π𝜋\piitalic_π, occupancy measure over state-action pairs dPh,π(s,a)=Ph(sh=s,ah=a;π)subscriptsuperscript𝑑𝜋𝑃𝑠𝑎subscript𝑃formulae-sequencesubscript𝑠𝑠subscript𝑎𝑎𝜋d^{h,\pi}_{P}(s,a)=P_{h}(s_{h}=s,a_{h}=a;\pi)italic_d start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a ; italic_π ). We write πP=(πh)h=0H1subscriptsuperscript𝜋𝑃superscriptsubscriptsubscriptsuperscript𝜋0𝐻1\pi^{*}_{P}=(\pi^{*}_{h})_{h=0}^{H-1}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT to denote an optimal non-stationary deterministic policy, which maximizes VP,rπ=(VP,rh,π)h=0H1subscriptsuperscript𝑉𝜋𝑃𝑟superscriptsubscriptsubscriptsuperscript𝑉𝜋𝑃𝑟0𝐻1V^{\pi}_{P,r}=(V^{h,\pi}_{P,r})_{h=0}^{H-1}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r end_POSTSUBSCRIPT = ( italic_V start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT.

Hybrid Reinforcement Learning: The goal of hybrid RL on hhitalic_hMDP (Po,r)superscript𝑃𝑜𝑟(P^{o},r)( italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r ) is to learn a good policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG based on adaptive datasets consisting of both offline datasets and on-policy datasets. Given timestep h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], offline dataset 𝒟h,Poμ={(si,ai,si)i=1moff}subscriptsuperscript𝒟𝜇superscript𝑃𝑜superscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖superscriptsubscript𝑠𝑖𝑖1subscript𝑚off\mathcal{D}^{\mu}_{h,P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})_{i=1}^{m_{\mathrm{% off}}}\}caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } is generated by siPh,si,aiosimilar-tosuperscriptsubscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖s_{i}^{\prime}\sim P^{o}_{h,s_{i},a_{i}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT with the (si,ai)subscript𝑠𝑖subscript𝑎𝑖(s_{i},a_{i})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) pairs i.i.d. sampled by μhΔ(𝒮×𝒜)subscript𝜇Δ𝒮𝒜\mu_{h}\in\Delta(\mathcal{S}\times\mathcal{A})italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S × caligraphic_A ) offline data distribution. For convenience, μ=(μh)h=0H1𝜇superscriptsubscriptsubscript𝜇0𝐻1\mu=(\mu_{h})_{h=0}^{H-1}italic_μ = ( italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT also denotes the offline policy that generates 𝒟Poμsubscriptsuperscript𝒟𝜇superscript𝑃𝑜\mathcal{D}^{\mu}_{P^{o}}caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Given timestep h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], on-policy dataset 𝒟h,Poπ={(si,ai,si)i=1mon}subscriptsuperscript𝒟𝜋superscript𝑃𝑜superscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖superscriptsubscript𝑠𝑖𝑖1subscript𝑚on\mathcal{D}^{\pi}_{h,P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})_{i=1}^{m_{\mathrm{% on}}}\}caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } is generated by (si,ai)dPoh,πsimilar-tosubscript𝑠𝑖subscript𝑎𝑖subscriptsuperscript𝑑𝜋superscript𝑃𝑜(s_{i},a_{i})\sim d^{h,\pi}_{P^{o}}( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_d start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and siPh,si,aiosimilar-tosuperscriptsubscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖s_{i}^{\prime}\sim P^{o}_{h,s_{i},a_{i}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all the previously learned policies π𝜋\piitalic_π by the algorithm. Song et al., (2023) proposes Hybrid Q-learning (HyQ) algorithm with general function approximation capabilities and provable guarantees for hybrid RL. The HyQ algorithm (c.f. Song et al., (2023, Algorithm 1)) is quite straightforward: For each iteration k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], do backward induction of the FQI algorithm on timesteps h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] using the adaptive datasets described above. Finally, for some starting state s0d0similar-tosubscript𝑠0subscript𝑑0s_{0}\sim d_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the performance guarantee of algorithm policies {πk}k[K]subscriptsubscript𝜋𝑘𝑘delimited-[]𝐾\{\pi_{k}\}_{k\in[K]}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT is given by bounding the cumulative suboptimality quantity 0k=[K][VPo,r0,π(s0)VPo,r0,πk(s0)]0subscript𝑘delimited-[]𝐾delimited-[]subscriptsuperscript𝑉0superscript𝜋superscript𝑃𝑜𝑟subscript𝑠0subscriptsuperscript𝑉0subscript𝜋𝑘superscript𝑃𝑜𝑟subscript𝑠00\leq\sum_{k=[K]}[V^{0,\pi^{*}}_{P^{o},r}(s_{0})-V^{0,\pi_{k}}_{P^{o},r}(s_{0})]0 ≤ ∑ start_POSTSUBSCRIPT italic_k = [ italic_K ] end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT 0 , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT 0 , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ]. We note the total adaptive dataset size is N𝑁Nitalic_N to provide comparable results with offline RL.

Finite-Horizon Robust φ𝜑\varphiitalic_φ-Regularized Markov Decision Process: Again, let Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT be the nominal model. A finite-horizon discounted Robust φ𝜑\varphiitalic_φ-Regularized Markov Decision Process (hhitalic_hRRMDP) tuple (𝒮,𝒜,Po=(Pho)h=0H1,r=(rh)h=0H1,λ,H,φ,d0)formulae-sequence𝒮𝒜superscript𝑃𝑜superscriptsubscriptsubscriptsuperscript𝑃𝑜0𝐻1𝑟superscriptsubscriptsubscript𝑟0𝐻1𝜆𝐻𝜑subscript𝑑0(\mathcal{S},\mathcal{A},P^{o}=(P^{o}_{h})_{h=0}^{H-1},r=(r_{h})_{h=0}^{H-1},% \lambda,H,\varphi,d_{0})( caligraphic_S , caligraphic_A , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = ( italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT , italic_r = ( italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT , italic_λ , italic_H , italic_φ , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) where λ>0𝜆0\lambda>0italic_λ > 0 is a robustness parameter and φ::𝜑\varphi:\mathbb{R}\to\mathbb{R}italic_φ : blackboard_R → blackboard_R is as before. For h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], the robust regularized reward function is rhλ(s,a)=rh(s,a)+λDφ(Ph,s,a,Ph,s,ao)subscriptsuperscript𝑟𝜆𝑠𝑎subscript𝑟𝑠𝑎𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎r^{\lambda}_{h}(s,a)=r_{h}(s,a)+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})italic_r start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ). For h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], the robust regularized value function of a policy π𝜋\piitalic_π is defined as Vh,λπ=infP𝒫VP,rhλh,π,subscriptsuperscript𝑉𝜋𝜆subscriptinfimum𝑃𝒫subscriptsuperscript𝑉𝜋𝑃subscriptsuperscript𝑟𝜆V^{\pi}_{h,\lambda}=\inf_{P\in\mathcal{P}}V^{h,\pi}_{P,r^{\lambda}_{h}},italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT = roman_inf start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT , where 𝒫=h,s,a𝒫h,s,a\mathcal{P}=\otimes_{h,s,a}\mathcal{P}_{h,s,a}caligraphic_P = ⊗ start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT and 𝒫h,s,a={Ph,s,aΔ(𝒮):Ph,s,aPh,s,ao,(s,a)𝒮×𝒜 and h[H]}subscript𝒫𝑠𝑎conditional-setsubscript𝑃𝑠𝑎Δ𝒮formulae-sequencemuch-less-thansubscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎for-all𝑠𝑎𝒮𝒜 and delimited-[]𝐻\mathcal{P}_{h,s,a}=\{P_{h,s,a}\in\Delta(\mathcal{S}):P_{h,s,a}\ll P^{o}_{h,s,% a},\forall(s,a)\in\mathcal{S}\times\mathcal{A}\text{ and }h\in[H]\}caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ) : italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , ∀ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A and italic_h ∈ [ italic_H ] }. By definition, for any π𝜋\piitalic_π, it follows that Vh,λπVPo,rh,πHsubscriptsuperscript𝑉𝜋𝜆subscriptsuperscript𝑉𝜋superscript𝑃𝑜𝑟𝐻V^{\pi}_{h,\lambda}\leq V^{h,\pi}_{P^{o},r}\leq Hitalic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ≤ italic_V start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT ≤ italic_H. For h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], the optimal robust regularized value function is Vh,λ=maxπVh,λπsubscriptsuperscript𝑉𝜆subscript𝜋subscriptsuperscript𝑉𝜋𝜆V^{*}_{h,\lambda}=\max_{\pi}V^{\pi}_{h,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT, and πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the robust regularized optimal policy that achieves this optimal value. For convenience, we denote Vh,λsubscriptsuperscript𝑉𝜆V^{*}_{h,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT(Qh,λsubscriptsuperscript𝑄𝜆Q^{*}_{h,\lambda}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT) as Vhsubscriptsuperscript𝑉V^{*}_{h}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT(Qhsubscriptsuperscript𝑄Q^{*}_{h}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT) for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ]. We again note that, for each h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], 𝒫𝒫\mathcal{P}caligraphic_P satisfies the (s,a)𝑠𝑎(s,a)( italic_s , italic_a )-rectangularity condition (Iyengar,, 2005) by definition. It enables the existence of a non-stationary deterministic policy for πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (Zhang et al.,, 2023). We formalize this in Proposition 6. We denote Vπ=𝔼sd0[V0π(s)]superscript𝑉𝜋subscript𝔼similar-to𝑠subscript𝑑0delimited-[]subscriptsuperscript𝑉𝜋0𝑠V^{\pi}=\mathbb{E}_{s\sim d_{0}}[V^{\pi}_{0}(s)]italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] as the expected total reward.

For convenience, we let QH,λπ=0superscriptsubscript𝑄𝐻𝜆𝜋0Q_{H,\lambda}^{\pi}=0italic_Q start_POSTSUBSCRIPT italic_H , italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = 0 for any π𝜋\piitalic_π. For any h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], denote the robust regularized Bellman operator 𝒯:𝒮×𝒜𝒮×𝒜:𝒯superscript𝒮𝒜superscript𝒮𝒜\mathcal{T}:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\to\mathbb{R}^{\mathcal{S% }\times\mathcal{A}}caligraphic_T : blackboard_R start_POSTSUPERSCRIPT caligraphic_S × caligraphic_A end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT caligraphic_S × caligraphic_A end_POSTSUPERSCRIPT as

(𝒯Qh+1)(s,a)=rh(s,a)+infPh,s,a𝒫h,s,a(𝔼sPh,s,a[maxaQh+1(s,a)]+λDφ(Ph,s,a,Ph,s,ao)).𝒯subscript𝑄1𝑠𝑎subscript𝑟𝑠𝑎subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑎subscript𝑄1superscript𝑠superscript𝑎𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle(\mathcal{T}Q_{h+1})(s,a)=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P% }_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[\max_{a^{\prime}}Q_{h+1% }(s^{\prime},a^{\prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}.( caligraphic_T italic_Q start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ) ) . (10)

As QH=0subscriptsuperscript𝑄𝐻0Q^{*}_{H}=0italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = 0, doing backward iteration of 𝒯𝒯\mathcal{T}caligraphic_T, i.e., the robust dynamic programming Qh=𝒯Qh+1subscriptsuperscript𝑄𝒯subscriptsuperscript𝑄1Q^{*}_{h}=\mathcal{T}Q^{*}_{h+1}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT, we get Qhsubscriptsuperscript𝑄Q^{*}_{h}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ]. For each timestep h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], we also get the robust optimal policy as πh(s)=argmaxaQh(s,a)subscriptsuperscript𝜋𝑠subscriptargmax𝑎subscriptsuperscript𝑄𝑠𝑎\pi^{*}_{h}(s)=\operatorname*{arg\,max}_{a}Q^{*}_{h}(s,a)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ).

3.1 Problem Conceptualization

In this section, we study the hybrid finite-horizon robust TV-regularized RL problem, acquiring the necessary insights to construct our algorithm (Algorithm 2) in the next section. We conceptualize for general φ𝜑\varphiitalic_φ-divergence, but only propose our algorithm for total variation φ𝜑\varphiitalic_φ-divergence. The goal here is to learn a good robust policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG based on adaptive datasets consisting of both offline datasets and on-policy datasets. We start by noting a direct consequence of Proposition 1 due to similar inner minimization problems in both infinite horizon (3) and finite horizon (10) operators.

Corollary 1.

For any Qh:𝒮×𝒜[0,H]:subscript𝑄𝒮𝒜0𝐻Q_{h}:\mathcal{S}\times\mathcal{A}\to[0,H]italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , italic_H ] and h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], the robust regularized Bellman operator 𝒯𝒯\mathcal{T}caligraphic_T (10) can be equivalently written as

(𝒯\displaystyle(\mathcal{T}( caligraphic_T Qh+1)(s,a)=rh(s,a)γinfηΘ(λ𝔼sPh,s,ao[φ((ηVh+1(s))/λ)]η),\displaystyle Q_{h+1})(s,a)=r_{h}(s,a)-\gamma\inf_{\eta\in\Theta}(\lambda% \mathbb{E}_{s^{\prime}\sim P^{o}_{h,s,a}}[\varphi^{*}\left({(\eta-V_{h+1}(s^{% \prime}))}/{\lambda}\right)]-\eta),italic_Q start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_γ roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) ] - italic_η ) , (11)

where Vh+1(s)=maxa𝒜Qh+1(s,a)subscript𝑉1𝑠subscript𝑎𝒜subscript𝑄1𝑠𝑎V_{h+1}(s)=\max_{a\in\mathcal{A}}Q_{h+1}(s,a)italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) and ΘΘ\Theta\subset\mathbb{R}roman_Θ ⊂ blackboard_R is some bounded real line that depends on φsuperscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

As in Section 2, this dual reformulation enables us to use the datasets from only the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT for estimating the robust regularized operator in its primal form (10).

We start by recalling the philosophy of the HyQ algorithm (Song et al.,, 2023) to use the FQI algorithm for adaptive datasets. We do the same for our hybrid finite-horizon robust φ𝜑\varphiitalic_φ-regularized RL problem here. For each h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], we need to estimate the true Bellman error 𝔼s,aμh[|𝒯Qh+1(s,a)Qh(s,a)|]+t=0k1𝔼s,adh,Poπt[|𝒯Qh+1(s,a)Qh(s,a)|]subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscriptsuperscript𝑄1𝑠𝑎subscriptsuperscript𝑄𝑠𝑎superscriptsubscript𝑡0𝑘1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑡superscript𝑃𝑜delimited-[]𝒯subscriptsuperscript𝑄1𝑠𝑎subscriptsuperscript𝑄𝑠𝑎\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}Q^{*}_{h+1}(s,a)-Q^{*}_{h}(s,a)|]+\sum% _{t=0}^{k-1}\mathbb{E}_{s,a\sim d^{\pi_{t}}_{h,P^{o}}}[|\mathcal{T}Q^{*}_{h+1}% (s,a)-Q^{*}_{h}(s,a)|]blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] using offline dataset from μhsubscript𝜇\mu_{h}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and the on-policy dataset from dh,Poπtsubscriptsuperscript𝑑subscript𝜋𝑡superscript𝑃𝑜d^{\pi_{t}}_{h,P^{o}}italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by the learned policies from the algorithm. We remark that the out-of-data-distribution issue appears when we only have access to the offline dataset to estimate the summation term above, which depends on dh,Poπtsubscriptsuperscript𝑑subscript𝜋𝑡superscript𝑃𝑜d^{\pi_{t}}_{h,P^{o}}italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

As discussed in Section 2, the true Bellman error itself involves solving an inner convex minimization problem in 𝒯Qh+1(s,a)𝒯subscriptsuperscript𝑄1𝑠𝑎\mathcal{T}Q^{*}_{h+1}(s,a)caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) (11) for every (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) and hhitalic_h that is challenging for countably large state setting. To alleviate this challenging task, we again utilize the functional minimization Proposition 2 developed in Section 2. For any hhitalic_h, we denote the set of admissible distributions of nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT as 𝔻h={μh}{dh,Poπ| for any policy (including non-stationary) π}subscript𝔻subscript𝜇conditional-setsubscriptsuperscript𝑑𝜋superscript𝑃𝑜 for any policy (including non-stationary) 𝜋\mathbb{D}_{h}=\{\mu_{h}\}\cup\{d^{\pi}_{h,P^{o}}\,|\text{\,for any policy (% including non-stationary)\,}\pi\}blackboard_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = { italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } ∪ { italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | for any policy (including non-stationary) italic_π }. Now we redefine dual loss for any fh+1h+1,νh𝔻hformulae-sequencesubscript𝑓1subscript1subscript𝜈subscript𝔻f_{h+1}\in\mathcal{F}_{h+1},\nu_{h}\in\mathbb{D}_{h}italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, as

Ldualsubscript𝐿dual\displaystyle L_{\mathrm{dual}}italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT (g;fh+1,νh)=𝔼s,aνh,sPh,s,ao[λφ((g(s,a)maxafh+1(s,a))/λ)g(s,a)].𝑔subscript𝑓1subscript𝜈subscript𝔼formulae-sequencesimilar-to𝑠𝑎subscript𝜈similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]𝜆superscript𝜑𝑔𝑠𝑎subscriptsuperscript𝑎subscript𝑓1superscript𝑠superscript𝑎𝜆𝑔𝑠𝑎\displaystyle(g;f_{h+1},\nu_{h})=\mathbb{E}_{s,a\sim\nu_{h},s^{\prime}\sim P^{% o}_{h,s,a}}[\lambda\varphi^{*}((g(s,a)-\max_{a^{\prime}}f_{h+1}(s^{\prime},a^{% \prime}))/{\lambda})-g(s,a)].( italic_g ; italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ] . (12)

We state a direct consequence of Proposition 2 here.

Corollary 2.

Let Ldualsubscript𝐿dualL_{\mathrm{dual}}italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT be the loss function defined in (12). Fix h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] and consider any policy π𝜋\piitalic_π. Then, for any function fh+1:𝒮×𝒜[0,H]:subscript𝑓1𝒮𝒜0𝐻f_{h+1}:\mathcal{S}\times\mathcal{A}\rightarrow[0,H]italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , italic_H ] and any νh𝔻hsubscript𝜈subscript𝔻\nu_{h}\in\mathbb{D}_{h}italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, we have

infgL1(νh)Ldual(g;fh+1,νh)=𝔼s,aνh[infηΘ(λ𝔼sPh,s,ao[φ((ηmaxafh+1(s,a))/λ)]η)].subscriptinfimum𝑔superscript𝐿1subscript𝜈subscript𝐿dual𝑔subscript𝑓1subscript𝜈subscript𝔼similar-to𝑠𝑎subscript𝜈delimited-[]subscriptinfimum𝜂Θ𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑𝜂subscriptsuperscript𝑎subscript𝑓1superscript𝑠superscript𝑎𝜆𝜂\displaystyle\inf_{g\in L^{1}(\nu_{h})}L_{\mathrm{dual}}(g;f_{h+1},\nu_{h})=% \mathbb{E}_{s,a\sim\nu_{h}}\Big{[}\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{h,s,a}}[\varphi^{*}({(\eta-\max_{a^{\prime}}f_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})]-\eta)\Big{]}.roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) ] - italic_η ) ] . (13)

For any given fh:𝒮×𝒜[0,H]:subscript𝑓𝒮𝒜0𝐻f_{h}:\mathcal{S}\times\mathcal{A}\rightarrow[0,H]italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , italic_H ] and hhitalic_h, we redefine operator 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for all g𝒢h𝑔subscript𝒢g\in\mathcal{G}_{h}italic_g ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, as

(𝒯gfh+1)(s,a)=rh(s,a)λ𝔼sPh,s,ao[φ((g(s,a)maxafh+1(s,a))/λ)]+g(s,a).subscript𝒯𝑔subscript𝑓1𝑠𝑎subscript𝑟𝑠𝑎𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑𝑔𝑠𝑎subscriptsuperscript𝑎subscript𝑓1superscript𝑠superscript𝑎𝜆𝑔𝑠𝑎\displaystyle(\mathcal{T}_{g}f_{h+1})(s,a)=r_{h}(s,a)-\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{h,s,a}}[\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})]+g(s,a).( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) ] + italic_g ( italic_s , italic_a ) . (14)

We have all the necessary tools now. In the following subsection, we develop an algorithm that naturally extends our RPQ algorithm using adaptive datasets.

3.2 Hybrid Robust regularized Q-iteration

In this section, we propose our algorithm based on the tools developed so far. Our proposed algorithm is called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q) Algorithm, summarized in Algorithm 2. The total variation DTVsubscript𝐷TVD_{\mathrm{TV}}italic_D start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT φ𝜑\varphiitalic_φ-divergence (1) is defined with φ(t)=|t1|/2𝜑𝑡𝑡12\varphi(t)=|t-1|/2italic_φ ( italic_t ) = | italic_t - 1 | / 2. The inputs to this algorithm are the offline dataset, and two general function classes =h[H]h,𝒢=h[H]𝒢h\mathcal{F}=\otimes_{h\in[H]}\mathcal{F}_{h},\mathcal{G}=\otimes_{h\in[H]}% \mathcal{G}_{h}caligraphic_F = ⊗ start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_G = ⊗ start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. For any h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], h(f:𝒮×𝒜[0,H])\mathcal{F}_{h}\subset(f:\mathcal{S}\times\mathcal{A}\rightarrow[0,H])caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊂ ( italic_f : caligraphic_S × caligraphic_A → [ 0 , italic_H ] ) and 𝒢h(g:𝒮×𝒜[0,λ])\mathcal{G}_{h}\subset(g:\mathcal{S}\times\mathcal{A}\rightarrow[0,\lambda])caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊂ ( italic_g : caligraphic_S × caligraphic_A → [ 0 , italic_λ ] ) represent action-value functions and dual-variable functions at hhitalic_h, respectively. We redefine, using (17), the empirical dual loss and the robust empirical squared robust regularized Bellman error for dataset 𝒟𝒟\mathcal{D}caligraphic_D as

L^dual(g;f,𝒟)=𝔼𝒟[(g(si,ai)maxaf(si,a))+g(si,ai)]andsubscript^𝐿dual𝑔𝑓𝒟subscript𝔼𝒟delimited-[]subscript𝑔subscript𝑠𝑖subscript𝑎𝑖subscriptsuperscript𝑎𝑓superscriptsubscript𝑠𝑖superscript𝑎𝑔subscript𝑠𝑖subscript𝑎𝑖and\displaystyle\widehat{L}_{\mathrm{dual}}(g;f,\mathcal{D})=\mathbb{E}_{\mathcal% {D}}[(g(s_{i},a_{i})-\max_{a^{\prime}}f(s_{i}^{\prime},a^{\prime}))_{+}-g(s_{i% },a_{i})]\quad\text{and}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f , caligraphic_D ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ( italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] and (15)
L^robQ(Q;f,g,𝒟)=𝔼𝒟[([rh(si,ai)(g(si,ai)maxaf(si,a))++g(si,ai)Q(si,ai))2].\displaystyle\widehat{L}_{\mathrm{robQ}}(Q;f,g,\mathcal{D})=\mathbb{E}_{% \mathcal{D}}[([r_{h}(s_{i},a_{i})-(g(s_{i},a_{i})-\max_{a^{\prime}}f(s_{i}^{% \prime},a^{\prime}))_{+}+g(s_{i},a_{i})-Q(s_{i},a_{i}))^{2}].over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_robQ end_POSTSUBSCRIPT ( italic_Q ; italic_f , italic_g , caligraphic_D ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ( [ italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_Q ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (16)
Algorithm 2 HyTQ Algorithm
1:  Input: Offline dataset 𝒟hμμhsimilar-tosubscriptsuperscript𝒟𝜇subscript𝜇\mathcal{D}^{\mu}_{h}\sim\mu_{h}caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of size moff=Tsubscript𝑚off𝑇m_{\mathrm{off}}=Titalic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = italic_T for h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], general function classes \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G.
2:  Initialize: Qh00hsubscriptsuperscript𝑄00subscriptQ^{0}_{h}\equiv 0\in\mathcal{F}_{h}italic_Q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≡ 0 ∈ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.
3:  for k=0,,K1𝑘0𝐾1k=0,\cdots,K-1italic_k = 0 , ⋯ , italic_K - 1  do
4:     Compute πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as πk,h(s)=argmaxaQhk(s,a)subscript𝜋𝑘𝑠subscriptargmax𝑎subscriptsuperscript𝑄𝑘𝑠𝑎\pi_{k,h}(s)=\operatorname*{arg\,max}_{a}Q^{k}_{h}(s,a)italic_π start_POSTSUBSCRIPT italic_k , italic_h end_POSTSUBSCRIPT ( italic_s ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a )
5:     hfor-all\forall h∀ italic_h, collect monsubscript𝑚onm_{\mathrm{on}}italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT===1111 online dataset 𝒟hkdh,Poπksimilar-tosubscriptsuperscript𝒟𝑘superscriptsubscript𝑑superscript𝑃𝑜subscript𝜋𝑘\mathcal{D}^{k}_{h}\sim d_{h,P^{o}}^{\pi_{k}}caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT italic_h , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
6:     Initialize: QHk+10Hsubscriptsuperscript𝑄𝑘1𝐻0subscript𝐻Q^{k+1}_{H}\equiv 0\in\mathcal{F}_{H}italic_Q start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ≡ 0 ∈ caligraphic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT
7:     for h=H1,,0𝐻10h=H-1,\cdots,0italic_h = italic_H - 1 , ⋯ , 0 do
8:        Aggregate adaptive dataset 𝒟hk=𝒟hμ+τ=0k𝒟hτsubscriptsuperscript𝒟𝑘subscriptsuperscript𝒟𝜇superscriptsubscript𝜏0𝑘subscriptsuperscript𝒟𝜏\mathcal{D}^{k}_{h}=\mathcal{D}^{\mu}_{h}+\sum_{\tau=0}^{k}\mathcal{D}^{\tau}_% {h}caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT
9:        Dual variable function minimization: (c.f. (15))ghk+1=argming𝒢hL^dual(g;Qh+1k+1,𝒟hk)subscriptsuperscript𝑔𝑘1subscriptargmin𝑔subscript𝒢subscript^𝐿dual𝑔subscriptsuperscript𝑄𝑘11subscriptsuperscript𝒟𝑘g^{k+1}_{h}=\operatorname*{arg\,min}_{g\in\mathcal{G}_{h}}\widehat{L}_{\mathrm% {dual}}(g;Q^{k+1}_{h+1},\mathcal{D}^{k}_{h})italic_g start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_Q start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
10:        Robust φ𝜑\varphiitalic_φ-regularized Q-update: (c.f. (16))Qhk+1=argminQhL^robQ(Q;Qh+1k+1,ghk+1,𝒟hk)subscriptsuperscript𝑄𝑘1subscriptargmin𝑄subscriptsubscript^𝐿robQ𝑄subscriptsuperscript𝑄𝑘11subscriptsuperscript𝑔𝑘1subscriptsuperscript𝒟𝑘Q^{k+1}_{h}=\operatorname*{arg\,min}_{Q\in\mathcal{F}_{h}}\widehat{L}_{\mathrm% {robQ}}(Q;Q^{k+1}_{h+1},g^{k+1}_{h},\mathcal{D}^{k}_{h})italic_Q start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_Q ∈ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_robQ end_POSTSUBSCRIPT ( italic_Q ; italic_Q start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
11:     end for
12:  end for

3.3 Cumulative Suboptimality Guarantee

We now discuss the performance guarantee in terms of the cumulative suboptimality of our HyTQ Algorithm. We first mention all the assumptions before we present our main result and add a brief discussion. We provide detailed discussion in Section 4.

Assumption 4 (Robust Bellman Error Transfer Coefficient).

Let μhΔ(𝒮×𝒜)subscript𝜇Δ𝒮𝒜\mu_{h}\in\Delta(\mathcal{S}\times\mathcal{A})italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S × caligraphic_A ) be the offline data generating distribution. For any f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, there exists a small positive constant C(π)𝐶superscript𝜋C(\pi^{*})italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that satisfies

h=0H1𝔼s,adPoh,π[𝒯fh+1(s,a)fh(s,a)]h=0H1𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]C(π).superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑superscript𝜋superscript𝑃𝑜delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎𝐶superscript𝜋\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{h,\pi^{*}}_{P^{o}}}[\mathcal{T}f_% {h+1}(s,a)-f_{h}(s,a)]}{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{% T}f_{h+1}(s,a)-f_{h}(s,a)|]}\leq C(\pi^{*}).divide start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG ≤ italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

We develop this assumption from non-robust offline RL work (Song et al.,, 2023).

Assumption 5 (Approximate Value Realizability and Robust Bellman Completeness).

Let ε,rsubscript𝜀r\varepsilon_{\mathcal{F},\mathrm{r}}italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT\geq00 be small constant. For any h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] and gh𝒢hsubscript𝑔subscript𝒢g_{h}\in\mathcal{G}_{h}italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, we have inffhsupνhf𝒯ghfh+12,νh2ε,rsubscriptinfimum𝑓subscriptsubscriptsupremumsubscript𝜈superscriptsubscriptnorm𝑓subscript𝒯subscript𝑔subscript𝑓12subscript𝜈2subscript𝜀𝑟\inf_{f\in\mathcal{F}_{h}}\sup_{\nu_{h}}\|f-\mathcal{T}_{g_{h}}f_{h+1}\|_{2,% \nu_{h}}^{2}\leq\varepsilon_{\mathcal{F},r}roman_inf start_POSTSUBSCRIPT italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_f - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε start_POSTSUBSCRIPT caligraphic_F , italic_r end_POSTSUBSCRIPT for all νh𝔻hsubscript𝜈subscript𝔻\nu_{h}\in\mathbb{D}_{h}italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Furthermore, for any fh+1h+1subscript𝑓1subscript1f_{h+1}\in\mathcal{F}_{h+1}italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT, we have 𝒯ghfh+1hsubscript𝒯subscript𝑔subscript𝑓1subscript\mathcal{T}_{g_{h}}f_{h+1}\in\mathcal{F}_{h}caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

Assumption 6 (Approximate Dual Realizability).

Let ε𝒢subscript𝜀𝒢\varepsilon_{\mathcal{G}}italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT be some small positive constant. For any h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] and fh+1h+1subscript𝑓1subscript1f_{h+1}\in\mathcal{F}_{h+1}italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT, we have infg𝒢hLdual(g;fh+1,νh)infgL1(νh)Ldual(g;fh+1,νh)ε𝒢subscriptinfimum𝑔subscript𝒢subscript𝐿dual𝑔subscript𝑓1subscript𝜈subscriptinfimum𝑔superscript𝐿1subscript𝜈subscript𝐿dual𝑔subscript𝑓1subscript𝜈subscript𝜀𝒢\inf_{g\in\mathcal{G}_{h}}L_{\mathrm{dual}}(g;f_{h+1},\nu_{h})-\inf_{g\in L^{1% }(\nu_{h})}L_{\mathrm{dual}}(g;f_{h+1},\nu_{h})\leq\varepsilon_{\mathcal{G}}roman_inf start_POSTSUBSCRIPT italic_g ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, for all νh𝔻hsubscript𝜈subscript𝔻\nu_{h}\in\mathbb{D}_{h}italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

We adapt these two enhanced realizability assumptions from the non-robust offline RL literature (Xie et al.,, 2021; Foster et al.,, 2022; Song et al.,, 2023) to our problem. The assumptions in Section 2 are not directly comparable, but for the sake of exposition, let h,𝒢hsubscriptsubscript𝒢\mathcal{F}_{h},\mathcal{G}_{h}caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the same across hhitalic_h. First, note that Assumption 3 with all-policy concentrability (Assumption 1) is equivalent to Assumption 6. Second, Assumption 2 implies infff𝒯gf2,μ2εsubscriptinfimum𝑓superscriptsubscriptnorm𝑓subscript𝒯𝑔𝑓2𝜇2subscript𝜀\inf_{f\in\mathcal{F}}\|f-\mathcal{T}_{g}f\|_{2,\mu}^{2}\leq\varepsilon_{% \mathcal{F}}roman_inf start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ italic_f - caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT. Now again, with all-policy concentrability (Assumption 1), it is the approximate value realizability (Assumption 5). We know non-robust offline RL is hard (Foster et al.,, 2022) with just realizability and all-policy concentrability. As robust RL is at least as hard as its non-robust counterpart (Panaganti and Kalathil,, 2022), we also assume Bellman completeness in Assumption 5.

Assumption 7 (Bilinear Models).

Consider any f,g𝒢formulae-sequence𝑓𝑔𝒢f\in\mathcal{F},g\in\mathcal{G}italic_f ∈ caligraphic_F , italic_g ∈ caligraphic_G and h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ]. Let πfsuperscript𝜋𝑓\pi^{f}italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT be greedy policy w.r.t f𝑓fitalic_f. There exists an unknown feature mapping Xh:d:subscript𝑋maps-tosuperscript𝑑X_{h}:\mathcal{F}\mapsto\mathbb{R}^{d}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_F ↦ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and two unknown weight mappings Whq,Whd:×𝒢d:subscriptsuperscript𝑊qsubscriptsuperscript𝑊dmaps-to𝒢superscript𝑑W^{\mathrm{q}}_{h},W^{\mathrm{d}}_{h}:\mathcal{F}\times\mathcal{G}\mapsto% \mathbb{R}^{d}italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_F × caligraphic_G ↦ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with maxfXh(f)2BXsubscript𝑓subscriptnormsubscript𝑋𝑓2subscript𝐵𝑋\max_{f}\|X_{h}(f)\|_{2}\leq B_{X}roman_max start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and maxf,gmax{Whq(f,g)2,Whd(f,g)2}BWsubscript𝑓𝑔subscriptnormsubscriptsuperscript𝑊q𝑓𝑔2subscriptnormsubscriptsuperscript𝑊d𝑓𝑔2subscript𝐵𝑊\max_{f,g}\max\{\|W^{\mathrm{q}}_{h}(f,g)\|_{2},\|W^{\mathrm{d}}_{h}(f,g)\|_{2% }\}\leq B_{W}roman_max start_POSTSUBSCRIPT italic_f , italic_g end_POSTSUBSCRIPT roman_max { ∥ italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT such that both 𝔼dhπf[(fh(s,a)Tghfh+1)+]=|Xh(f),Whq(f,g)|subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓delimited-[]subscriptsubscript𝑓𝑠𝑎subscript𝑇subscript𝑔subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊q𝑓𝑔\mathbb{E}_{d^{\pi^{f}}_{h}}[(f_{h}(s,a)-T_{g_{h}}f_{h+1})_{+}]=\left\lvert% \left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle\right\rvertblackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = | ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩ | and 𝔼dhπf[(Tghfh+1Tfh+1)+]=|Xh(f),Whd(f,g)|subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓delimited-[]subscriptsubscript𝑇subscript𝑔subscript𝑓1𝑇subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊d𝑓𝑔\mathbb{E}_{d^{\pi^{f}}_{h}}[(T_{g_{h}}f_{h+1}-Tf_{h+1})_{+}]=\left\lvert\left% \langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangle\right\rvertblackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = | ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩ | holds.

We adapt this problem architecture assumption on Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT with \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G for our setting from a series of non-robust online RL works (Jin et al., 2021a, ; Du et al.,, 2021).

Assumption 8 (Fail-state).

There is a fail state sf,hsubscript𝑠𝑓s_{f,h}italic_s start_POSTSUBSCRIPT italic_f , italic_h end_POSTSUBSCRIPT for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], such that rh(sf,a)=0subscript𝑟subscript𝑠𝑓𝑎0r_{h}(s_{f},a)=0italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_a ) = 0 and Ph,sf,a(sf,h)=1subscript𝑃subscript𝑠𝑓𝑎subscript𝑠𝑓1P_{h,s_{f},a}(s_{f,h})=1italic_P start_POSTSUBSCRIPT italic_h , italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_f , italic_h end_POSTSUBSCRIPT ) = 1, for all a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A and P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P satisfying DTV(Ph,s,a,Ph,s,ao)max{1,H/λ}subscript𝐷TVsubscript𝑃superscriptsuperscript𝑠superscript𝑎subscriptsuperscript𝑃𝑜superscriptsuperscript𝑠superscript𝑎1𝐻𝜆D_{\mathrm{TV}}(P_{h^{\prime},s^{\prime},a^{\prime}},P^{o}_{h^{\prime},s^{% \prime},a^{\prime}})\leq\max\{1,H/\lambda\}italic_D start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≤ roman_max { 1 , italic_H / italic_λ } for all h,s,asuperscriptsuperscript𝑠superscript𝑎h^{\prime},s^{\prime},a^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

This assumption enables us to ground the value of such P𝑃Pitalic_P’s at sf,hsubscript𝑠𝑓s_{f,h}italic_s start_POSTSUBSCRIPT italic_f , italic_h end_POSTSUBSCRIPT to zero, which helps us to get a tight duality (c.f. (17)) without having to know the minimum value across large 𝒮𝒮\mathcal{S}caligraphic_S. There are approximations to this in the literature (Wang and Zou,, 2022). But we adopt this less restrictive assumption from Panaganti et al., (2022) for convenience.

Now we state our main theoretical result on the performance of the HyTQ algorithm. The proof is presented in Appendix E.

Theorem 2.

Let Assumptions 4, 5, 6, 7 and 8 hold. Fix any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). Then, HyTQ algorithm policies {πk}k[K]subscriptsubscript𝜋𝑘𝑘delimited-[]𝐾\{\pi_{k}\}_{k\in[K]}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT satisfy k=0K1(VπVπk)𝒪~(ε,r+ε𝒢)+𝒪~(max{C(π),1}dH2K(λ+H)log(|||𝒢|/δ))superscriptsubscript𝑘0𝐾1superscript𝑉superscript𝜋superscript𝑉subscript𝜋𝑘~𝒪subscript𝜀rsubscript𝜀𝒢~𝒪𝐶superscript𝜋1𝑑superscript𝐻2𝐾𝜆𝐻𝒢𝛿\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq\widetilde{\mathcal{O}}(\sqrt{% \varepsilon_{\mathcal{F},\mathrm{r}}}+\varepsilon_{\mathcal{G}})+\widetilde{% \mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dH^{2}K}(\lambda+H)\log(|\mathcal{F}||% \mathcal{G}|/\delta))∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG + italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) + over~ start_ARG caligraphic_O end_ARG ( roman_max { italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , 1 } square-root start_ARG italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K end_ARG ( italic_λ + italic_H ) roman_log ( | caligraphic_F | | caligraphic_G | / italic_δ ) ) with probability at least 1δ1𝛿1-\delta1 - italic_δ.

Remark 2.

We specialize this result for bilinear model examples, linear occupancy complexity model (Du et al.,, 2021, Definition 4.7) and low-rank feature selection model (Du et al.,, 2021, Definition A.1), in Section E.2. We also specialize this result using standard online-to-batch conversion (Shalev-Shwartz and Ben-David,, 2014) for uniform policy over HyTQ policies {πk}k[K]subscriptsubscript𝜋𝑘𝑘delimited-[]𝐾\{\pi_{k}\}_{k\in[K]}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT to provide sample complexity 𝒪~(max{(C(π))2,1}dH3(λ+H)2(log(|||𝒢|/δ))2)/ε2~𝒪superscript𝐶superscript𝜋21𝑑superscript𝐻3superscript𝜆𝐻2superscript𝒢𝛿2superscript𝜀2\widetilde{\mathcal{O}}(\max\{(C(\pi^{*}))^{2},1\}dH^{3}(\lambda+H)^{2}(\log(|% \mathcal{F}||\mathcal{G}|/\delta))^{2})/{\varepsilon^{2}}over~ start_ARG caligraphic_O end_ARG ( roman_max { ( italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 1 } italic_d italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_λ + italic_H ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( | caligraphic_F | | caligraphic_G | / italic_δ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the Section E.2.

4 Theoretical Discussions and Final Remarks

In this section, we first discuss the proof ideas for our results, focusing on discussions of the assumptions and their improvements. Next, we compare our results with the most relevant ones from the robust RL literature. Our Table 1 should be used as a reference. Finally, we discuss the bilinear model architecture in detail, as ours is the first work to consider it in the robust RL setting under the general function architecture for the value and dual functions approximations.

Discussions on Proof Sketch: We first discuss our RPQ algorithm (Algorithm 1) result. We note that the concentrability (Assumption 1) assumption requires the data-generating policy to be robust exploratory. That is, it covers the state-action occupancy induced by any policy and any φ𝜑\varphiitalic_φ-divergence set transition model. We reiterate the proof idea of the suboptimality result (Panaganti et al.,, 2022, Theorem 1) of the RFQI algorithm (Panaganti et al.,, 2022, Algorithm 1). We highlight the most important differences with Panaganti et al., (2022); Zhang et al., (2023) here. Firsty, we generalize the robust performance lemma (𝔼s0d0[Vπ]𝔼s0d0[VπK]2QπQK1,ν/(1γ)subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉superscript𝜋subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉subscript𝜋𝐾2subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1𝜈1𝛾\mathbb{E}_{s_{0}\sim d_{0}}[{V}^{\pi^{*}}]-\mathbb{E}_{s_{0}\sim d_{0}}[V^{% \pi_{K}}]\leq 2\|Q^{\pi^{*}}-Q_{K}\|_{1,\nu}/(1-\gamma)blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] - blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] ≤ 2 ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT / ( 1 - italic_γ ) at Eq. 26) for any general φ𝜑\varphiitalic_φ-divergence problem. Secondly, we identify that it is hard to come up with a unified analysis for general φ𝜑\varphiitalic_φ-divergences in robust RL setting via the dual reformulation of the distributionally robust optimization problem (Duchi and Namkoong,, 2018, Proposition 1). Thus, a direct extension of the results in Panaganti et al., (2022) is hard for general φ𝜑\varphiitalic_φ-divergences. By RPQ analyses, we showcase that it is indeed possible to get a unified analysis for the robust RL problem using the RRMDP framework. Thirdly, we show the generalization bounds for the empirical risk minimization (Proposition 7) and least squares (Proposition 8) estimators for general φ𝜑\varphiitalic_φ-divergences with unified results. By these three points, equipped with the more general robust exploratory concentrability (Assumption 1), we have a unified general φ𝜑\varphiitalic_φ-divergences suboptimality result (Theorem 1) for the RPQ algorithm.

We now discuss our HyTQ algorithm (Algorithm 2) result. We immediately make an important note here. The concentrability assumption improvement is two-fold: all-policy concentrability (Assumption 9) to single concentrability, and then to the robust Bellman error transfer coefficient (Assumption 4) via Lemma 8. We refer to Foster et al., (2022); Song et al., (2023) for further discussion on such concentrability assumption improvements and tightness in the non-robust offline RL. We leave it to future work for more tightness of these assumptions in the robust RL setting. We execute a tighter analysis in our HyTQ algorithm result (Theorem 2) compared to our RPQ algorithm TV φ𝜑\varphiitalic_φ-divergence specialized result (Theorem 4). We summarize the steps as follows:
Step (a)𝑎(a)( italic_a ): We meticulously arrive at the following robust performance lemma (c.f. Eqs. 37 and 39) for each algorithm iteration k𝑘kitalic_k: 𝔼s0d0[V0π(s0)V0πk(s0)]h=0H1𝔼s,adhπ[(𝒯Qh+1k(s,a)Qhk(s,a))+]+h=0H1𝔼s,adhπk[(Qhk(s,a)𝒯Qh+1k(s,a))+].subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉subscript𝜋𝑘0subscript𝑠0superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑superscript𝜋delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎subscriptsuperscript𝑄𝑘𝑠𝑎superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsubscriptsuperscript𝑄𝑘𝑠𝑎𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0})-V^{\pi_{k}}_{0}(s_{0})]% \leq\textstyle\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{*}}_{h}}[(\mathcal{T% }Q^{k}_{h+1}(s,a)-Q^{k}_{h}(s,a))_{+}]+\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{% \pi_{k}}_{h}}[(Q^{k}_{h}(s,a)-\mathcal{T}Q^{k}_{h+1}(s,a))_{+}].blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] + ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] . We highlight that the first summand here depends on the samples from state-action occupancy of the optimal robust policy and for the second summand it is the w.r.t. the learned HyTQ policies. It is now intuitive to connect the first summand with the offline samples and the second with the online samples.
Finally, step (b)𝑏(b)( italic_b ): With the above gathered intuition, firstly, the history dependent dataset collected by different offline data-generating policy and the learned HyTQ policies on the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT warrants more sophisticated generalization bounds for the empirical risk minimization and least squares estimators. We prove a generalization bound for empirical risk minimization when the data are not necessarily i.i.d. but adapted to a stochastic process in Appendix C. This result is applicable to more machine learning problems outside of the scope of this paper as well. Finally, equipped with the transfer coefficient (Assumption 4) and bilinear model (Assumption 7) assumptions for the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, we formally show generalization bounds for the empirical risk minimization and least squares estimators in Propositions 9 and 10 respectively.
We complete the proof by combining these two steps.

Remark 3.

We offer computational tractability in our RPQ and HyTQ algorithms due to the usage of empirical risk minimization (Steps 4 & 9 resp.), over the general function class 𝒢𝒢\mathcal{G}caligraphic_G, and least-squares (Steps 5 & 10 resp.), over the general function class \mathcal{F}caligraphic_F, computationally tractable estimators. This two-step estimator update avoids the complexity of solving the inner problem for each state-action pair (leading to scaling issues for high-dimensional problems) in the original robust Bellman operators (Eqs. 3 and 10). To the best of our knowledge, no purely online or purely offline robust RL algorithms are known to be tractable in this sense, except other robust Q-iteration and actor-critic methods (discussed in Table 1) and except under much stronger coverage conditions (like single-policy and uniform) in the tabular setting.

Theoretical Guarantee Discussions: In the suboptimality result (Theorem 1) for the RPQ algorithm (Algorithm 1), we only mention the leading statistical bound with a problem-dependent (on φ𝜑\varphiitalic_φ-divergence) constant cφ(λ,γ)subscript𝑐𝜑𝜆𝛾c_{\varphi}(\lambda,\gamma)italic_c start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_λ , italic_γ ). We provide the exact constants pertaining to different φ𝜑\varphiitalic_φ-divergences in a restated statement of Theorem 1 in Theorem 3. Furthermore, the constants c1,c2,c3subscript𝑐1subscript𝑐2subscript𝑐3c_{1},c_{2},c_{3}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in Theorem 3 take different values for different φ𝜑\varphiitalic_φ-divergences provided in Proposition 3. Similarly, for the suboptimality result (Theorem 2) of the HyTQ algorithm (Algorithm 2), we provide a more detailed bound in a restated statement in Theorem 5.

In the following we provide comparisons of suboptimality results with relevant prior works. But first, we make an important note here on ρ𝜌\rhoitalic_ρ, the robustness radius parameter in RMDPs, and λ𝜆\lambdaitalic_λ, the robustness penalization parameter in RRMDPs, mentioned briefly in Table 1. (Levy et al.,, 2020; Yang et al.,, 2023) establish the regularized and constrained versions of DRO and robust MDP problems, respectively, are equivalent by connecting their respective (λ𝜆\lambdaitalic_λ and ρ𝜌\rhoitalic_ρ) robustness parameters. Moreover, both observe rigorously that λ𝜆\lambdaitalic_λ and ρ𝜌\rhoitalic_ρ are inversely related. This is intuitively true, as λ𝜆\lambda\to\inftyitalic_λ → ∞ and ρ0𝜌0\rho\to 0italic_ρ → 0 both yield the non-robust solutions on the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and as λ0𝜆0\lambda\to 0italic_λ → 0 and ρ𝜌\rho\to\inftyitalic_ρ → ∞ both yield the conservative solutions considering the entire probability simplex for the transition dynamics. However, it is an interesting open problem to establish an exact analytical relation between the robustness parameters λ𝜆\lambdaitalic_λ and ρ𝜌\rhoitalic_ρ. We leave this to future research as it is out of the scope of this work.

Here we specialize our result (Theorem 3) for the chi-square φ𝜑\varphiitalic_φ-divergence γ𝛾\gammaitalic_γR3L problem. We get the suboptimality for the RPQ algorithm as 𝒪~(max{1λ(1γ)2,λ}Clog(|||𝒢|)(1γ)2N)~𝒪1𝜆superscript1𝛾2𝜆𝐶𝒢superscript1𝛾2𝑁\widetilde{\mathcal{O}}\left(\frac{\max\{\frac{1}{\lambda(1-\gamma)^{2}},% \lambda\}\sqrt{C\log({|\mathcal{F}||\mathcal{G}|})}}{(1-\gamma)^{2}\sqrt{N}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG roman_max { divide start_ARG 1 end_ARG start_ARG italic_λ ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_λ } square-root start_ARG italic_C roman_log ( | caligraphic_F | | caligraphic_G | ) end_ARG end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_N end_ARG end_ARG ), where we only have presented the higher-order terms. The suboptimality of Algorithm 2 in Yang et al., (2023, Theorem 5.1) for chi-square φ𝜑\varphiitalic_φ-divergence is stated for λ=1/(1γ)𝜆11𝛾\lambda=1/(1-\gamma)italic_λ = 1 / ( 1 - italic_γ ) as 𝒪~(max{1(1γ)2,log(|𝒮||𝒜|)}dmin3(1γ)3N1/3)~𝒪1superscript1𝛾2𝒮𝒜subscriptsuperscript𝑑3superscript1𝛾3superscript𝑁13\widetilde{\mathcal{O}}\left(\frac{\max\{\frac{1}{(1-\gamma)^{2}},\sqrt{\log(|% \mathcal{S}||\mathcal{A}|)}\}}{d^{3}_{\min}(1-\gamma)^{3}N^{1/3}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG roman_max { divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , square-root start_ARG roman_log ( | caligraphic_S | | caligraphic_A | ) end_ARG } end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG ) where dminsubscript𝑑d_{\min}italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT is described in Table 1. We use the typical equivalence from RL literature for comparison between these two results in the tabular setting with generative/simulator modeling assumption: function approximation classes with full dimension yields log(|||𝒢|)=O(|𝒮||𝒜|)𝒢𝑂𝒮𝒜\log(\mathcal{|F||G|})=O(|\mathcal{S}||\mathcal{A}|)roman_log ( | caligraphic_F | | caligraphic_G | ) = italic_O ( | caligraphic_S | | caligraphic_A | ) (Panaganti et al.,, 2022) and uniform support data sampling yields μmin=1/(|𝒮||𝒜|)subscript𝜇1𝒮𝒜\mu_{\min}=1/(|\mathcal{S}||\mathcal{A}|)italic_μ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 1 / ( | caligraphic_S | | caligraphic_A | ) and C|𝒮||𝒜|𝐶𝒮𝒜C\leq|\mathcal{S}||\mathcal{A}|italic_C ≤ | caligraphic_S | | caligraphic_A | (Shi et al.,, 2023). Now our result with λ=1/(1γ)𝜆11𝛾\lambda=1/(1-\gamma)italic_λ = 1 / ( 1 - italic_γ ) reduces to 𝒪~(|𝒮||𝒜|(1γ)3N)~𝒪𝒮𝒜superscript1𝛾3𝑁\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}% \sqrt{N}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG | caligraphic_S | | caligraphic_A | end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG italic_N end_ARG end_ARG ) and their result (Yang et al.,, 2023) reduces to 𝒪~(|𝒮|3|𝒜|3max{1(1γ)2,log(|𝒮||𝒜|)}(1γ)3N1/3)~𝒪superscript𝒮3superscript𝒜31superscript1𝛾2𝒮𝒜superscript1𝛾3superscript𝑁13\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}|^{3}|\mathcal{A}|^{3}\max\{% \frac{1}{(1-\gamma)^{2}},\sqrt{\log(|\mathcal{S}||\mathcal{A}|)}\}}{(1-\gamma)% ^{3}N^{1/3}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG | caligraphic_S | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | caligraphic_A | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_max { divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , square-root start_ARG roman_log ( | caligraphic_S | | caligraphic_A | ) end_ARG } end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG ). Two comments warrant attention here. Firstly, compared to a model-based robust regularized algorithm (robust value iteration using empirical estimates of the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT) (Yang et al.,, 2023, Theorem 3.2), our suboptimality bound is worse off by the factors |𝒮||𝒜|𝒮𝒜\sqrt{|\mathcal{S}||\mathcal{A}|}square-root start_ARG | caligraphic_S | | caligraphic_A | end_ARG and 1/(1γ)11𝛾1/(1-\gamma)1 / ( 1 - italic_γ ). We leave it to future work to fine-tune and get optimal rates. Secondly, their result Yang et al., (2023, Theorem 5.1) exhibit inferior performance compared to ours in all parameters, but we do want to note that they make a first attempt to give suboptimality bounds for the stochastic approximation-based algorithm. The dependence on |𝒮||𝒜|𝒮𝒜|\mathcal{S}||\mathcal{A}|| caligraphic_S | | caligraphic_A | is typically known to be bad using the stochastic approximation technical tool (Chen et al.,, 2022), and Yang et al., (2023, Discussion on Page 16) conjectures using the Polyak-averaging technique to improve their suboptimality bound rate to N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT.

Here we discuss and compare our result for the total variation φ𝜑\varphiitalic_φ-divergence setting. As mentioned in Remark 1, we have a specialized result in Section D.2 for the total variation φ𝜑\varphiitalic_φ-divergence. We get the suboptimality result (Theorem 4) for the RPQ algorithm as 𝒪~(λCtvlog(|||𝒢|)(1γ)3N)~𝒪𝜆subscript𝐶tv𝒢superscript1𝛾3𝑁\widetilde{\mathcal{O}}\left(\frac{\lambda\sqrt{C_{\mathrm{tv}}\log({|\mathcal% {F}||\mathcal{G}|})}}{(1-\gamma)^{3}\sqrt{N}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_λ square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT roman_log ( | caligraphic_F | | caligraphic_G | ) end_ARG end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG italic_N end_ARG end_ARG ), where we again only have presented the higher-order terms. Panaganti et al., (2022, Theorem 1) mentioned in Table 1 also exhibits same suboptimality guarantee replacing λ𝜆\lambdaitalic_λ with ρ1superscript𝜌1\rho^{-1}italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. As we noted before, ρ𝜌\rhoitalic_ρ (the robustness radius parameter in RMDPs) and λ𝜆\lambdaitalic_λ (the robustness penalization parameter in RRMDPs) are inversely related, and for the TV φ𝜑\varphiitalic_φ-divergence we observe a straightforward relation between the two as λ=ρ1𝜆superscript𝜌1\lambda=\rho^{-1}italic_λ = italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Using the earlier arguments for a tabular setting bound, our result further reduces to 𝒪~(λ|𝒮||𝒜|(1γ)3N)~𝒪𝜆𝒮𝒜superscript1𝛾3𝑁\widetilde{\mathcal{O}}\left(\frac{\lambda|\mathcal{S}||\mathcal{A}|}{(1-% \gamma)^{3}\sqrt{N}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_λ | caligraphic_S | | caligraphic_A | end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG italic_N end_ARG end_ARG ). Now comparing this to the minimax lower bound (Shi et al.,, 2023, Theorem 2), our suboptimality bound is worse off by the factors |𝒮||𝒜|𝒮𝒜\sqrt{|\mathcal{S}||\mathcal{A}|}square-root start_ARG | caligraphic_S | | caligraphic_A | end_ARG and 1/(1γ)11𝛾1/(1-\gamma)1 / ( 1 - italic_γ ). Nevertheless, we push the boundaries by providing novel suboptimality guarantee studying the robust RL problem in the hybrid RL setting. Furthermore, as mentioned earlier in Remark 2, we provide the offline+online robust RL suboptimality guarantee 𝒪~(max{C(π),1}dH3(λ+H)log(|||𝒢|/δ)/N)~𝒪𝐶superscript𝜋1𝑑superscript𝐻3𝜆𝐻𝒢𝛿𝑁\widetilde{\mathcal{O}}\left({\max\{C(\pi^{*}),1\}\sqrt{dH^{3}}(\lambda+H)\log% (|\mathcal{F}||\mathcal{G}|/\delta)}/{\sqrt{N}}\right)over~ start_ARG caligraphic_O end_ARG ( roman_max { italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , 1 } square-root start_ARG italic_d italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ( italic_λ + italic_H ) roman_log ( | caligraphic_F | | caligraphic_G | / italic_δ ) / square-root start_ARG italic_N end_ARG ) in the Appendix E. We also remark that the HyTQ algorithm can be proposed under the RMDP setting with a similar suboptimality guarantee due to the similarity of the dual Bellman equations under the TV φ𝜑\varphiitalic_φ-divergence for RMDPs and RRMDPs (c.f. Eq. 33 and Xu et al., (2023, Lemma 8)). For the sake of consistency and novelty, we present our results solely for the RRMDP setting. As mentioned earlier, the concentrability assumption improvement is two-fold (Lemma 8): all-policy concentrability (Assumption 9) to single concentrability to transfer coefficient. This is the first of its kind result that does not yet have any existing lower bounds to compare in the robust RL setting. Under similar transfer coefficient, Bellman completeness, and bilinear model assumptions, the HyTQ algorithm sample complexity (Corollary 5) is comparable to that of a non-robust RL algorithm (Song et al.,, 2023), i.e., 𝒪~(max{(C(π))2,1}dH5log(H||/δ)/ε2)~𝒪superscript𝐶superscript𝜋21𝑑superscript𝐻5𝐻𝛿superscript𝜀2\widetilde{\mathcal{O}}({\max\{(C(\pi^{*}))^{2},1\}dH^{5}}\log(H|\mathcal{F}|/% \delta)/{\varepsilon^{2}})over~ start_ARG caligraphic_O end_ARG ( roman_max { ( italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 1 } italic_d italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT roman_log ( italic_H | caligraphic_F | / italic_δ ) / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). We leave it to future work for developing minimax rates and getting optimal algorithm guarantees.

Here we specialize our result (Theorem 3) for the KL φ𝜑\varphiitalic_φ-divergence γ𝛾\gammaitalic_γR3L problem. We get the suboptimality for RPQ as 𝒪~((λ+(1γ)1)exp{(λ(1γ))1}Clog(|||𝒢|)(1γ)2N)~𝒪𝜆superscript1𝛾1superscript𝜆1𝛾1𝐶𝒢superscript1𝛾2𝑁\widetilde{\mathcal{O}}\left(\frac{(\lambda+(1-\gamma)^{-1})\exp\{(\lambda(1-% \gamma))^{-1}\}\sqrt{C\log({|\mathcal{F}||\mathcal{G}|})}}{(1-\gamma)^{2}\sqrt% {N}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG ( italic_λ + ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) roman_exp { ( italic_λ ( 1 - italic_γ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } square-root start_ARG italic_C roman_log ( | caligraphic_F | | caligraphic_G | ) end_ARG end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_N end_ARG end_ARG ), where we only have presented the higher-order terms. Using the earlier arguments for a tabular setting bound, our result with λ=1/(1γ)𝜆11𝛾\lambda=1/(1-\gamma)italic_λ = 1 / ( 1 - italic_γ ) again reduces to 𝒪~(|𝒮||𝒜|(1γ)3N)~𝒪𝒮𝒜superscript1𝛾3𝑁\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}% \sqrt{N}}\right)over~ start_ARG caligraphic_O end_ARG ( divide start_ARG | caligraphic_S | | caligraphic_A | end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG italic_N end_ARG end_ARG ). Zhang et al., (2023, Theorem 5) mentioned in Table 1 also exhibits same suboptimality guarantee. Two remarks are in order here. Firstly, we remark that our RPQ algorithm and its theoretical guarantee unifies for a class of φ𝜑\varphiitalic_φ-divergence classes, whereas Zhang et al., (2023, Algorithm 1) is specialized for the KL φ𝜑\varphiitalic_φ-divergence. This steers towards our first main contribution discussed in Section 1. Secondly, we remark the robust regularized Bellman operator Eq. 3 for the KL φ𝜑\varphiitalic_φ-divergence has a special form due to the existence of an analytical worse-case transition model. This arrives at a special structure of the form of an exponential robust Bellman operator in a Q-value-variant space. This special structure helps avoid the dual variable function update (Step 4) in the RPQ algorithm and the log(|𝒢|)𝒢\log(|\mathcal{G}|)roman_log ( | caligraphic_G | ) factor in the suboptimal guarantee. We choose not to include this specialized result in this work (like we did for the TV φ𝜑\varphiitalic_φ-divergence in Section D.2) and directly point to Zhang et al., (2023). We do highlight here an important note for such a choice in our paper. The abovementioned special structure forces us to get online samples from all the transition kernels (c.f. Assumption 1), which is unrealistic in practice, to achieve an improvement in the hybrid robust RL setting. We leave it to future work for developing such improved algorithm guarantees in the hybrid robust RL setting for other φ𝜑\varphiitalic_φ-divergences.

Discussion of Bilinear Models in the Hybrid Robust RL setting: We emphasize that while our bilinear model for the HyTQ algorithm is specialized to low occupancy complexity (i.e. the occupancy measures themselves have a low-rank structure) and low-rank feature selection model (i.e. the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT has a low-rank structure) in Section E.2, the function classes \mathcal{F}caligraphic_F (Q-value representations) and 𝒢𝒢\mathcal{G}caligraphic_G (dual-value representations) can be arbitrary, potentially nonlinear function classes (neural tangent kernels, neural networks, etc). Thus, even in the tabular setting with large state space (e.g. |𝒮|>O(105)𝒮𝑂superscript105|\mathcal{S}|>O(10^{5})| caligraphic_S | > italic_O ( 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT )) for the bilinear model, our suboptimality bounds only scale with the complexity of the function classes \mathcal{F}caligraphic_F and 𝒢𝒢\mathcal{G}caligraphic_G, which can considerably be low compared to |𝒮|𝒮|\mathcal{S}|| caligraphic_S |. For example, linear function approximators (e.g. linear feature dimension d=log(|||𝒢|)|𝒮||𝒜|𝑑𝒢much-less-than𝒮𝒜d=\log(\mathcal{|F||G|})\ll|\mathcal{S}||\mathcal{A}|italic_d = roman_log ( | caligraphic_F | | caligraphic_G | ) ≪ | caligraphic_S | | caligraphic_A |), RKHS approximators with low dimension features, neural tangent kernels with low effective neural net dimension, and more function approximators. Moreover, our work solves the robust RL problem with more nuances, which is at least as hard as the non-robust RL problem. Thus, due to the new upcoming research status of robust RL in the general function approximation setting, we believe it is currently out of scope for this work to satisfy more general bilinear model classes (Du et al.,, 2021). Nevertheless, our initial findings for robust RL by the HyTQ algorithm in the hybrid learning setting reveal the hardness of finding larger model classes for RRMDPs with general φ𝜑\varphiitalic_φ-divergences.

We conclude this section with an exciting future research direction that remains unsolved in this paper. To solve the hybrid robust RL problem for general φ𝜑\varphiitalic_φ-divergence. In this work, we noticed while building hybrid learning for robust RL that one would require online samples from the worse-case model (c.f. the model that solves the inner problem in robust Bellman operator Eq. 10) for general φ𝜑\varphiitalic_φ-divergences due to the current analyses dependent on the bilinear models. We use the dual reformulation for the total variation φ𝜑\varphiitalic_φ-divergence and provide current results supporting the HyTQ algorithm. We remark that using the same approach for other general φ𝜑\varphiitalic_φ-divergences, we get exponential dependence on the horizon factor. This warrants more sophisticated algorithm designs for the hybrid robust RL problem under general φ𝜑\varphiitalic_φ-divergences.

5 Conclusion

In this work, we presented two robust RL algorithms. We proposed Robust φ𝜑\varphiitalic_φ-divergence-fitted Q-iteration algorithm for general φ𝜑\varphiitalic_φ-divergence in the offline RL setting. We provided performance guarantees with unified analysis for all φ𝜑\varphiitalic_φ-divergences with arbitrarily large state space using function approximation. To mitigate the out-of-data-distribution issue by improving the assumptions on data generation, we proposed a novel framework called hybrid robust RL that uses both offline and online interactions. We proposed the Total-variation-divergence Q-iteration algorithm in this framework with an accompanying guarantee. We have provided our theoretical guarantees in terms of suboptimality and sample complexity for both offline and offline+online robust RL settings. We also rigorously specialized our results to different φ𝜑\varphiitalic_φ-divergences and different bilinear modeling assumptions. We have provided detailed comparisons with relevant prior works while also discussing interesting future directions in the field of robust reinforcement learning.

Acknowledgment

KP acknowledges support from the ‘PIMCO Postdoctoral Fellow in Data Science’ fellowship at the California Institute of Technology. This work acknowledges support from NSF CNS-2146814, CPS-2136197, CNS-2106403, NGSDI-2105648, and funding from the Resnick Institute. EM acknowledges support from NSF award 2240110. We thank several anonymous ICML 2024 reviewers for their constructive comments on an earlier draft of this paper.

References

  • Agarwal et al., (2019) Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
  • Antos et al., (2008) Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
  • Bertsimas et al., (2018) Bertsimas, D., Gupta, V., and Kallus, N. (2018). Data-driven robust optimization. Math. Program., 167(2):235–292.
  • Blanchet et al., (2019) Blanchet, J., Kang, Y., and Murthy, K. (2019). Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857.
  • Blanchet et al., (2023) Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. Advances in Neural Information Processing Systems, 36.
  • Botvinick et al., (2019) Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in cognitive sciences, 23(5):408–422.
  • Bruns-Smith and Zhou, (2023) Bruns-Smith, D. and Zhou, A. (2023). Robust fitted-q-evaluation and iteration under sequentially exogenous unobserved confounders. arXiv preprint arXiv:2302.00662.
  • Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051.
  • Chen et al., (1996) Chen, J., Patton, R. J., and Zhang, H.-Y. (1996). Design of unknown input observers and robust fault detection filters. International Journal of control, 63(1):85–105.
  • Chen et al., (2020) Chen, R., Paschalidis, I. C., et al. (2020). Distributionally robust learning. Foundations and Trends® in Optimization, 4(1-2):1–243.
  • Chen et al., (2022) Chen, Z., Khodadadian, S., and Maguluri, S. T. (2022). Finite-sample analysis of off-policy natural actor–critic with linear function approximation. IEEE Control Systems Letters, 6:2611–2616.
  • Corporation, (2021) Corporation, N. (2021). Closing the sim2real gap with nvidia isaac sim and nvidia isaac replicator.
  • Csiszár, (1967) Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318.
  • Du et al., (2021) Du, S., Kakade, S., Lee, J., Lovett, S., Mahajan, G., Sun, W., and Wang, R. (2021). Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836.
  • Duchi and Namkoong, (2018) Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
  • Farahmand et al., (2010) Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
  • Fawzi et al., (2022) Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. (2022). Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53.
  • Foster et al., (2022) Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. (2022). Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919.
  • Fujimoto and Gu, (2021) Fujimoto, S. and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145.
  • Fujimoto et al., (2019) Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062.
  • Gao and Kleywegt, (2022) Gao, R. and Kleywegt, A. (2022). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research.
  • Huang et al., (2023) Huang, A., Chen, J., and Jiang, N. (2023). Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning, pages 13710–13752.
  • Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
  • (24) Jin, C., Liu, Q., and Miryoosefi, S. (2021a). Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418.
  • (25) Jin, J., Zhang, B., Wang, H., and Wang, L. (2021b). Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34:2771–2782.
  • Kostrikov et al., (2021) Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. (2021). Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783.
  • Kumar et al., (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784–11794.
  • Kumar et al., (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
  • Lange et al., (2012) Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer.
  • Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
  • Lesort et al., (2020) Lesort, T., Lomonaco, V., Stoian, A., Maltoni, D., Filliat, D., and Díaz-Rodríguez, N. (2020). Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information fusion, 58:52–68.
  • Levine et al., (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  • Levy et al., (2020) Levy, D., Carmon, Y., Duchi, J. C., and Sidford, A. (2020). Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860.
  • Liang et al., (2023) Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023). Single-trajectory distributionally robust reinforcement learning. arXiv preprint arXiv:2301.11721.
  • Liu et al., (2020) Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020). Provably good batch off-policy reinforcement learning without great exploration. In Neural Information Processing Systems.
  • Liu et al., (2022) Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust q𝑞qitalic_q-learning. In International Conference on Machine Learning, pages 13623–13643.
  • Mankowitz et al., (2020) Mankowitz, D. J., Levine, N., Jeong, R., Abdolmaleki, A., Springenberg, J. T., Shi, Y., Kay, J., Hester, T., Mann, T., and Riedmiller, M. (2020). Robust reinforcement learning for continuous control with model misspecification. In International Conference on Learning Representations.
  • Mannor et al., (2016) Mannor, S., Mebel, O., and Xu, H. (2016). Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509.
  • Maraun, (2016) Maraun, D. (2016). Bias correcting climate change simulations-a critical review. Current Climate Change Reports, 2:211–220.
  • Mirhoseini et al., (2021) Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J. W., Songhori, E., Wang, S., Lee, Y.-J., Johnson, E., Pathak, O., Nazi, A., et al. (2021). A graph placement methodology for fast chip design. Nature, 594(7862):207–212.
  • Munos, (2003) Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567.
  • Munos, (2007) Munos, R. (2007). Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561.
  • Munos and Szepesvári, (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857.
  • Namkoong and Duchi, (2016) Namkoong, H. and Duchi, J. C. (2016). Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29.
  • Nilim and El Ghaoui, (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
  • Panaganti, (2023) Panaganti, K. (2023). Robust Reinforcement Learning: Theory and Algorithms. PhD thesis, Texas A&M University.
  • (47) Panaganti, K. and Kalathil, D. (2021a). Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning (ICML), pages 511–520.
  • (48) Panaganti, K. and Kalathil, D. (2021b). Sample complexity of model-based robust reinforcement learning. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 2240–2245.
  • Panaganti and Kalathil, (2022) Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 9582–9602.
  • Panaganti et al., (2022) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022). Robust reinforcement learning using offline data. Advances in Neural Information Processing Systems (NeurIPS).
  • (51) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023a). Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. arXiv preprint arXiv:2310.18434.
  • (52) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023b). Distributionally robust behavioral cloning for robust imitation learning. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347.
  • Pioch et al., (2009) Pioch, N. J., Melhuish, J., Seidel, A., Santos Jr, E., Li, D., and Gorniak, M. (2009). Adversarial intent modeling using embedded simulation and temporal bayesian knowledge bases. In Modeling and Simulation for Military Operations IV, volume 7348, pages 115–126.
  • Robey et al., (2020) Robey, A., Hassani, H., and Pappas, G. J. (2020). Model-based robust deep learning: Generalizing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247.
  • Rockafellar and Wets, (2009) Rockafellar, R. T. and Wets, R. J.-B. (2009). Variational analysis, volume 317. Springer Science & Business Media.
  • Russel and Petrik, (2019) Russel, R. H. and Petrik, M. (2019). Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. Advances in Neural Information Processing Systems.
  • Scherrer et al., (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M. (2015). Approximate modified policy iteration and its application to the game of tetris. J. Mach. Learn. Res., 16(49):1629–1676.
  • Schmidt et al., (2015) Schmidt, T., Hertkorn, K., Newcombe, R., Marton, Z., Suppa, M., and Fox, D. (2015). Depth-based tracking with physical constraints for robot manipulation. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 119–126.
  • Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
  • Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Shah et al., (2018) Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018). Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, pages 621–635. Springer.
  • Shalev-Shwartz and Ben-David, (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
  • Shapiro, (2017) Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275.
  • Shi and Chi, (2022) Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767.
  • Shi et al., (2023) Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M., and Chi, Y. (2023). The curious price of distributional robustness in reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 36.
  • Silver et al., (2018) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
  • Sinha et al., (2017) Sinha, A., Namkoong, H., and Duchi, J. C. (2017). Certifiable distributional robustness with principled adversarial training. corr, abs/1710.10571. arXiv preprint arXiv:1710.10571.
  • Song et al., (2023) Song, Y., Zhou, Y., Sekhari, A., Bagnell, D., Krishnamurthy, A., and Sun, W. (2023). Hybrid rl: Using both offline and online data can make rl efficient. In The Eleventh International Conference on Learning Representations.
  • Sünderhauf et al., (2018) Sünderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., Upcroft, B., Abbeel, P., Burgard, W., Milford, M., et al. (2018). The limits and potentials of deep learning for robotics. The International journal of robotics research, 37(4-5):405–420.
  • Szepesvári and Munos, (2005) Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on Machine learning, pages 880–887.
  • Van Erven et al., (2015) Van Erven, T., Grunwald, P., Mehta, N. A., Reid, M., Williamson, R., et al. (2015). Fast rates in statistical and online learning. JMLR.
  • Vershynin, (2018) Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University press.
  • Wang et al., (2021) Wang, R., Foster, D., and Kakade, S. M. (2021). What are the statistical limits of offline {rl} with linear function approximation? In International Conference on Learning Representations.
  • (74) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023a). A finite sample complexity bound for distributionally robust q-learning. In International Conference on Artificial Intelligence and Statistics, pages 3370–3398.
  • (75) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). Sample complexity of variance-reduced distributionally robust q-learning. arXiv preprint arXiv:2305.18420.
  • (76) Wang, Y., Hu, Y., Xiong, J., and Zou, S. (2023c). Achieving minimax optimal sample complexity of offline reinforcement learning: A dro-based approach. arXiv preprint arXiv:2305.13289v2.
  • Wang and Zou, (2021) Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206.
  • Wang and Zou, (2022) Wang, Y. and Zou, S. (2022). Policy gradient method for robust reinforcement learning. In International Conference on Machine Learning, pages 23484–23526.
  • Wiesemann et al., (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183.
  • Xie et al., (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34.
  • Xu and Mannor, (2010) Xu, H. and Mannor, S. (2010). Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, pages 2505–2513.
  • Xu et al., (2023) Xu, Z., Panaganti, K., and Kalathil, D. (2023). Improved sample complexity bounds for distributionally robust reinforcement learning. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. Conference on Artificial Intelligence and Statistics.
  • Yang et al., (2021) Yang, J., Zhou, K., Li, Y., and Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
  • Yang et al., (2023) Yang, W., Wang, H., Kozuno, T., Jordan, S. M., and Zhang, Z. (2023). Avoiding model estimation in robust markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
  • Yu and Xu, (2015) Yu, P. and Xu, H. (2015). Distributionally robust counterpart in Markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543.
  • Zhang et al., (2023) Zhang, R., Hu, Y., and Li, N. (2023). Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
  • Zhou et al., (2023) Zhou, R., Liu, T., Cheng, M., Kalathil, D., Kumar, P., and Tian, C. (2023). Natural actor-critic for robust reinforcement learning with function approximation. In Thirty-seventh Conference on Neural Information Processing Systems.

☕ ☕ Supplementary Materials ☕ ☕

Appendix A Related Works ☕

Offline RL: Offline RL tackles the problem of learning optimal policy using minimal amount of offline/historical data collected according to a behavior policy (Lange et al.,, 2012; Levine et al.,, 2020). Due to offline data quality and no access to simulators or any world models for exploration, the offline RL problem suffers from the out-of-distribution (Robey et al.,, 2020; Yang et al.,, 2021) challenge. Many works (Fujimoto et al.,, 2019; Kumar et al.,, 2019, 2020; Fujimoto and Gu,, 2021; Kostrikov et al.,, 2021) have introduced deep offline RL algorithms aimed at alleviating the out-of-distribution issue by some variants of trust-region optimization (Schulman et al.,, 2015, 2017). The earliest and most promising theoretical investigations into model-free offline RL methodologies relied on the assumption of uniformly bounded concentrability such as the approximate modified policy iteration (AMPI) algorithm (Scherrer et al.,, 2015) and fitted Q-iteration (FQI) (Munos and Szepesvári,, 2008) algorithm. This assumption mandates that the ratio of the state-action occupancy distribution induced by any policy to the data generating distribution remains uniformly bounded across all states and actions (Munos,, 2007; Antos et al.,, 2008; Munos and Szepesvári,, 2008; Farahmand et al.,, 2010; Chen and Jiang,, 2019). This makes offline RL particularly challenging (Foster et al.,, 2022) and there have been efforts to understand the limits of this setting.

Robust RL: The robust Markov decision process framework (Nilim and El Ghaoui,, 2005; Iyengar,, 2005) tackles the challenge of formulating a policy resilient to model discrepancies between training and testing environments. Robust reinforcement learning problem pursues this objective in the data-driven domain. Deploying simplistic RL policies (Corporation,, 2021) can lead to catastrophic outcomes when faced with evident disparities in models. The optimization techniques and analyses in robust RL draw inspiration from the distributionally robust optimization (DRO) toolkit in supervised learning (Duchi and Namkoong,, 2018; Shapiro,, 2017; Gao and Kleywegt,, 2022; Bertsimas et al.,, 2018; Namkoong and Duchi,, 2016; Blanchet et al.,, 2019). Many heuristic works (Xu and Mannor,, 2010; Wiesemann et al.,, 2013; Yu and Xu,, 2015; Mannor et al.,, 2016; Russel and Petrik,, 2019) show robust RL is valuable in such scenarios involving disparities of a simulator model with the real-world model. Many recent works address fundamental issues of RMDP giving concrete theoretical understanding in terms of sample complexity (Panaganti and Kalathil, 2021b, ; Panaganti and Kalathil,, 2022; Xu et al.,, 2023; Shi and Chi,, 2022; Shi et al.,, 2023). Many works (Panaganti and Kalathil, 2021a, ; Wang and Zou,, 2021; Panaganti and Kalathil,, 2022) devise model-free online and offline robust RL algorithms employing general function approximation to handle potentially infinite state spaces. Recent work (Panaganti et al., 2023b, ) introduces distributional robustness in the imitation learning setting. There have been works (Panaganti,, 2023; Panaganti et al., 2023a, ; Wang et al., 2023c, ) connecting robust RL with offline RL by linking notions of robustness and pessimism.

Appendix B Useful Technical Results ☕☕

We state the following result from the penalized distributionally robust optimization literature (Levy et al.,, 2020).

Lemma 1 (Levy et al.,, 2020, Section A.1.2).

Let Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT be a distribution on the space 𝒳𝒳\mathcal{X}caligraphic_X and let l:𝒳:𝑙𝒳l:\mathcal{X}\to\mathbb{R}italic_l : caligraphic_X → blackboard_R be a loss function. For φ𝜑\varphiitalic_φ-divergence (1), we have

supPPo𝔼P[l(X)λDφ(P,Po)]=infηλ𝔼Po[φ(l(X)ηλ)]+η,subscriptsupremummuch-less-than𝑃superscript𝑃𝑜subscript𝔼𝑃delimited-[]𝑙𝑋𝜆subscript𝐷𝜑𝑃superscript𝑃𝑜subscriptinfimum𝜂𝜆subscript𝔼superscript𝑃𝑜delimited-[]superscript𝜑𝑙𝑋𝜂𝜆𝜂\displaystyle\sup_{P\ll P^{o}}\mathbb{E}_{P}[l(X)-\lambda D_{\varphi}(P,P^{o})% ]=\inf_{\eta\in\mathbb{R}}~{}~{}\lambda\mathbb{E}_{P^{o}}\left[\varphi^{*}% \left(\frac{l(X)-\eta}{\lambda}\right)\right]+\eta,roman_sup start_POSTSUBSCRIPT italic_P ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ italic_l ( italic_X ) - italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ] = roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT italic_λ blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG italic_l ( italic_X ) - italic_η end_ARG start_ARG italic_λ end_ARG ) ] + italic_η ,

where φ(s)=supt0{stφ(t)}superscript𝜑𝑠subscriptsupremum𝑡0𝑠𝑡𝜑𝑡\varphi^{*}(s)=\sup_{t\geq 0}\{st-\varphi(t)\}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = roman_sup start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT { italic_s italic_t - italic_φ ( italic_t ) } is the Fenchel conjugate function of φ𝜑\varphiitalic_φ. Moreover, the optimization on the right hand side is convex in η𝜂\etaitalic_η.

We state a standard concentration inequality here.

Lemma 2 (Bernstein’s Inequality (Vershynin,, 2018, Theorem 2.8.4)).

Fix any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). If X1,,XTsubscript𝑋1subscript𝑋𝑇X_{1},\cdots,X_{T}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are independent and identically distributed random variables with finite second moment. Assume that |Xt𝔼[Xt]|Msubscript𝑋𝑡𝔼delimited-[]subscript𝑋𝑡𝑀|X_{t}-\mathbb{E}[X_{t}]|\leq M| italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - blackboard_E [ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] | ≤ italic_M, for all t𝑡titalic_t. Then we have with probability at least 1δ1𝛿1-\delta1 - italic_δ:

|𝔼[X1]1Tt=1TXt|2𝔼[X12]log(2/δ)T+Mlog(2/δ)3T.𝔼delimited-[]subscript𝑋11𝑇superscriptsubscript𝑡1𝑇subscript𝑋𝑡2𝔼delimited-[]superscriptsubscript𝑋122𝛿𝑇𝑀2𝛿3𝑇\Bigg{|}\mathbb{E}[X_{1}]-\frac{1}{T}\sum_{t=1}^{T}X_{t}\Bigg{|}\leq\sqrt{% \frac{2\mathbb{E}[X_{1}^{2}]\log(2/\delta)}{T}}+\frac{M\log(2/\delta)}{3T}.| blackboard_E [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ square-root start_ARG divide start_ARG 2 blackboard_E [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_log ( 2 / italic_δ ) end_ARG start_ARG italic_T end_ARG end_ARG + divide start_ARG italic_M roman_log ( 2 / italic_δ ) end_ARG start_ARG 3 italic_T end_ARG .

We now state a useful concentration inequality when the samples are not necessarily i.i.d. but adapted to a stochastic process.

Lemma 3 (Freedman’s Inequality (Song et al.,, 2023, Lemma 14)).

Let X1,,XTsubscript𝑋1subscript𝑋𝑇X_{1},\cdots,X_{T}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT be a sequence of M>0𝑀0M>0italic_M > 0-bounded real valued random variables where XtPtsimilar-tosubscript𝑋𝑡subscript𝑃𝑡X_{t}\sim P_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from some stochastic process Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that depends on the history X1,,Xt1subscript𝑋1subscript𝑋𝑡1X_{1},\cdots,X_{t-1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Then, for any δ>0𝛿0\delta>0italic_δ > 0 and λ[0,1/2M]𝜆012𝑀\lambda\in[0,1/2M]italic_λ ∈ [ 0 , 1 / 2 italic_M ], we have with probability at least 1δ1𝛿1-\delta1 - italic_δ:

|t=1T(Xt𝔼[XtPt])|λt=1T(2M|𝔼[XtPt]|+𝔼[Xt2Pt])+log(2/δ)λ.\displaystyle\Bigg{|}\sum_{t=1}^{T}(X_{t}-\mathbb{E}[X_{t}\mid P_{t}])\Bigg{|}% \leq\lambda\sum_{t=1}^{T}(2M|\mathbb{E}[X_{t}\mid P_{t}]|+\mathbb{E}[X_{t}^{2}% \mid P_{t}])+\frac{\log(2/\delta)}{\lambda}.| ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - blackboard_E [ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) | ≤ italic_λ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( 2 italic_M | blackboard_E [ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] | + blackboard_E [ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) + divide start_ARG roman_log ( 2 / italic_δ ) end_ARG start_ARG italic_λ end_ARG .

We now state a result for the generalization bounds on empirical risk minimization (ERM) (Shalev-Shwartz and Ben-David,, 2014).

Lemma 4 (ERM Generalization Bound (Panaganti et al.,, 2022, Lemma 3)).

Let P𝑃Pitalic_P be the data generating distribution on the space 𝒳𝒳\mathcal{X}caligraphic_X and let \mathcal{H}caligraphic_H be a given hypothesis class of functions. Assume that for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and hh\in\mathcal{H}italic_h ∈ caligraphic_H for loss function l𝑙litalic_l we have that |l(h,x)|c1𝑙𝑥subscript𝑐1|l(h,x)|\leq c_{1}| italic_l ( italic_h , italic_x ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for some positive constant c1>0subscript𝑐10c_{1}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and l(h,x)𝑙𝑥l(h,x)italic_l ( italic_h , italic_x ) is c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT-Lipschitz in hhitalic_h. Given a dataset 𝒟={Xi}i=1N𝒟superscriptsubscriptsubscript𝑋𝑖𝑖1𝑁\mathcal{D}=\{X_{i}\}_{i=1}^{N}caligraphic_D = { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, generated independently from P𝑃Pitalic_P, denote h^^\hat{h}over^ start_ARG italic_h end_ARG as the ERM solution, i.e. h^=argminh(1/N)i=1Nl(h,Xi)^subscriptargmin1𝑁superscriptsubscript𝑖1𝑁𝑙subscript𝑋𝑖\hat{h}=\operatorname*{arg\,min}_{h\in\mathcal{H}}(1/N)\sum_{i=1}^{N}l(h,X_{i})over^ start_ARG italic_h end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ( 1 / italic_N ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l ( italic_h , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Furthermore, let \mathcal{H}caligraphic_H be a finite hypothesis class, i.e. ||<|\mathcal{H}|<\infty| caligraphic_H | < ∞, with |hx|c2𝑥subscript𝑐2|h\circ x|\leq c_{2}| italic_h ∘ italic_x | ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for all hh\in\mathcal{H}italic_h ∈ caligraphic_H and x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. For any fixed δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and hargminh𝔼XP[l(h,X)]superscriptsubscriptargminsubscript𝔼similar-to𝑋𝑃delimited-[]𝑙𝑋h^{*}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\mathbb{E}_{X\sim P}[l(h,X)]italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_h , italic_X ) ], we have

𝔼XP[l(h^,X)]𝔼XP[l(h,X)]2c2c32log(||)N+5c12log(8/δ)N,subscript𝔼similar-to𝑋𝑃delimited-[]𝑙^𝑋subscript𝔼similar-to𝑋𝑃delimited-[]𝑙superscript𝑋2subscript𝑐2subscript𝑐32𝑁5subscript𝑐128𝛿𝑁\displaystyle\mathbb{E}_{X\sim P}[l(\hat{h},X)]-\mathbb{E}_{X\sim P}[l(h^{*},X% )]\leq 2c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{H}|)}{N}}+5c_{1}\sqrt{\frac{2% \log(8/\delta)}{N}},blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( over^ start_ARG italic_h end_ARG , italic_X ) ] - blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X ) ] ≤ 2 italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( | caligraphic_H | ) end_ARG start_ARG italic_N end_ARG end_ARG + 5 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 8 / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG ,

with probability at least 1δ1𝛿1-\delta1 - italic_δ.

We now state a result from variational analysis literature (Rockafellar and Wets,, 2009) that is useful to relate minimization of integrals and the integrals of pointwise minimization under decomposable spaces.

Remark 4.

A few examples of decomposable spaces are Lp(𝒮×𝒜,Σ(𝒮×𝒜),μ)superscript𝐿𝑝𝒮𝒜Σ𝒮𝒜𝜇L^{p}(\mathcal{S}\times\mathcal{A},\Sigma(\mathcal{S}\times\mathcal{A}),\mu)italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( caligraphic_S × caligraphic_A , roman_Σ ( caligraphic_S × caligraphic_A ) , italic_μ ), for any p1𝑝1p\geq 1italic_p ≥ 1, and (𝒮×𝒜,Σ(𝒮×𝒜))𝒮𝒜Σ𝒮𝒜\mathcal{M}(\mathcal{S}\times\mathcal{A},\Sigma(\mathcal{S}\times\mathcal{A}))caligraphic_M ( caligraphic_S × caligraphic_A , roman_Σ ( caligraphic_S × caligraphic_A ) ), the space of all Σ(𝒮×𝒜)Σ𝒮𝒜\Sigma(\mathcal{S}\times\mathcal{A})roman_Σ ( caligraphic_S × caligraphic_A )-measurable functions.

Lemma 5 (Rockafellar and Wets,, 2009, Theorem 14.60).

Let 𝒳𝒳\mathcal{X}caligraphic_X be a space of measurable functions from ΩΩ\Omegaroman_Ω to \mathbb{R}blackboard_R that is decomposable relative to a σ𝜎\sigmaitalic_σ-finite measure μ𝜇\muitalic_μ on the σ𝜎\sigmaitalic_σ-algebra 𝒜𝒜\mathcal{A}caligraphic_A. Let f:Ω×:𝑓Ωf:\Omega\times\mathbb{R}\to\mathbb{R}italic_f : roman_Ω × blackboard_R → blackboard_R (finite-valued) be a normal integrand. Then, we have

infx𝒳ωΩf(ω,x(ω))μ(dω)=ωΩ(infxf(ω,x))μ(dω).subscriptinfimum𝑥𝒳subscript𝜔Ω𝑓𝜔𝑥𝜔𝜇d𝜔subscript𝜔Ωsubscriptinfimum𝑥𝑓𝜔𝑥𝜇d𝜔\inf_{x\in\mathcal{X}}\int_{\omega\in\Omega}f(\omega,x(\omega))\mu(\mathrm{d}% \omega)=\int_{\omega\in\Omega}\left(\inf_{x\in\mathbb{R}}f(\omega,x)\right)\mu% (\mathrm{d}\omega).roman_inf start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT italic_f ( italic_ω , italic_x ( italic_ω ) ) italic_μ ( roman_d italic_ω ) = ∫ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT ( roman_inf start_POSTSUBSCRIPT italic_x ∈ blackboard_R end_POSTSUBSCRIPT italic_f ( italic_ω , italic_x ) ) italic_μ ( roman_d italic_ω ) .

Moreover, as long as the above infimum is finite, we have that xargminx𝒳ωΩf(ω,x(ω))μ(dω)superscript𝑥subscriptargmin𝑥𝒳subscript𝜔Ω𝑓𝜔𝑥𝜔𝜇d𝜔x^{\prime}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}\int_{\omega\in\Omega}f% (\omega,x(\omega))\mu(\mathrm{d}\omega)italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT italic_f ( italic_ω , italic_x ( italic_ω ) ) italic_μ ( roman_d italic_ω ) if and only if x(ω)argminxf(ω,x)superscript𝑥𝜔subscriptargmin𝑥𝑓𝜔𝑥x^{\prime}(\omega)\in\operatorname*{arg\,min}_{x\in\mathbb{R}}f(\omega,x)italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ω ) ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ blackboard_R end_POSTSUBSCRIPT italic_f ( italic_ω , italic_x ) for μ𝜇\muitalic_μ-almost everywhere.

Now we state a few results that will be useful for the analysis of our finite-horizon results in this work. The following result (Song et al.,, 2023, Lemma 6) is useful under the use of bilinear model approximation. This result follows from the elliptical potential lemma (Lattimore and Szepesvári,, 2020, Lemma 19.4) for deterministic vectors.

Lemma 6 (Elliptical Potential Lemma).

Let Xh(f1),,Xh(fT)dsubscript𝑋superscript𝑓1subscript𝑋superscript𝑓𝑇superscript𝑑X_{h}(f^{1}),\cdots,X_{h}(f^{T})\in\mathbb{R}^{d}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , ⋯ , italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a sequence of vectors with Xh(ft)BX<normsubscript𝑋superscript𝑓𝑡subscript𝐵𝑋\left\|X_{h}(f^{t})\right\|\leq B_{X}<\infty∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ≤ italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT < ∞ for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T and fix σBX2𝜎subscriptsuperscript𝐵2𝑋\sigma\geq B^{2}_{X}italic_σ ≥ italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Define Σt;h=τ=1tXh(fτ)Xh(fτ)+σ𝟙d×dsubscriptΣ𝑡superscriptsubscript𝜏1𝑡subscript𝑋superscript𝑓𝜏subscript𝑋superscriptsuperscript𝑓𝜏top𝜎subscript1𝑑𝑑\Sigma_{t;h}=\sum_{\tau=1}^{t}X_{h}(f^{\tau})X_{h}(f^{\tau})^{\top}+\sigma% \mathds{1}_{d\times d}roman_Σ start_POSTSUBSCRIPT italic_t ; italic_h end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ) italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ blackboard_1 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]. Then, the following holds: t=1TXh(ft)Σt1;h12dTlog(1+(TBX2/(σd))).superscriptsubscript𝑡1𝑇subscriptnormsubscript𝑋superscript𝑓𝑡superscriptsubscriptΣ𝑡112𝑑𝑇1𝑇superscriptsubscript𝐵𝑋2𝜎𝑑\sum_{t=1}^{T}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\leq\sqrt{2dT\log(1+({TB_{% X}^{2}}/{(\sigma d)}))}.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_t - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ square-root start_ARG 2 italic_d italic_T roman_log ( 1 + ( italic_T italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( italic_σ italic_d ) ) ) end_ARG .

We now state a result for the generalization bounds on the least-squares regression problem when the data are not necessarily i.i.d. but adapted to a stochastic process. We refer to Van Erven et al., (2015) for more statistical and online learning generalization bounds for a wider class of loss functions.

Lemma 7 (Online Least-squares Generalization Bound (Song et al.,, 2023, Lemma 3)).

Let L,M>0𝐿𝑀0L,M>0italic_L , italic_M > 0, δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), and let 𝒳𝒳\mathcal{X}caligraphic_X be an input space and 𝒴𝒴\mathcal{Y}caligraphic_Y be a target space . Let :𝒳[M,M]:maps-to𝒳𝑀𝑀\mathcal{H}:\mathcal{X}\mapsto[-M,M]caligraphic_H : caligraphic_X ↦ [ - italic_M , italic_M ] be a given real-valued hypothesis class of functions with ||<|\mathcal{H}|<\infty| caligraphic_H | < ∞. Given a dataset 𝒟={(xi,yi)}i=1N𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, denote h^^\widehat{h}over^ start_ARG italic_h end_ARG as the least square solution, i.e. h^=argminhi=1N(h(xt)yt)2^subscriptargminsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑥𝑡subscript𝑦𝑡2\widehat{h}=\operatorname*{arg\,min}_{h\in\mathcal{H}}\sum_{i=1}^{N}(h(x_{t})-% y_{t})^{2}over^ start_ARG italic_h end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The dataset 𝒟𝒟\mathcal{D}caligraphic_D is generated as xtPtsimilar-tosubscript𝑥𝑡subscript𝑃𝑡x_{t}\sim P_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from some stochastic process Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that depends on the history {(x1,y1),,(xt1,yt1)}subscript𝑥1subscript𝑦1subscript𝑥𝑡1subscript𝑦𝑡1\{(x_{1},y_{1}),\dots,(x_{t-1},y_{t-1})\}{ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) }, and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sampled via the conditional probability p(xt)p(\cdot\mid x_{t})italic_p ( ⋅ ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as ytp(xt)=h(xt)+εt,y_{t}\sim p(\cdot\mid x_{t})=h^{*}(x_{t})+\varepsilon_{t},italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p ( ⋅ ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , where the function hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT satisfies approximate realizability i.e. infh(1/N)t=1N𝔼xPt(h(x)h(x))2γ,subscriptinfimum1𝑁superscriptsubscript𝑡1𝑁subscript𝔼similar-to𝑥subscript𝑃𝑡superscriptsuperscript𝑥𝑥2𝛾\inf_{h\in\mathcal{H}}(1/N)\sum_{t=1}^{N}\mathbb{E}_{x\sim P_{t}}(h^{*}(x)-h(x% ))^{2}\leq\gamma,roman_inf start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ( 1 / italic_N ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) - italic_h ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_γ , and εtt=1Nsuperscriptsubscriptsubscript𝜀𝑡𝑡1𝑁{\varepsilon_{t}}_{t=1}^{N}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are independent random variables such that 𝔼[ytxt]=h(xt)𝔼delimited-[]conditionalsubscript𝑦𝑡subscript𝑥𝑡superscriptsubscript𝑥𝑡\mathbb{E}[y_{t}\mid x_{t}]=h^{*}(x_{t})blackboard_E [ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Suppose it also holds maxt|yt|Lsubscript𝑡subscript𝑦𝑡𝐿\max_{t}|y_{t}|\leq Lroman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ italic_L and maxx|h(x)|Msubscript𝑥superscript𝑥𝑀\max_{x}|h^{*}(x)|\leq Mroman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) | ≤ italic_M. Then, the least square solution satisfies with probability at least 1δ1𝛿1-\delta1 - italic_δ:

t=1N𝔼xPt(h^(x)h(x))2superscriptsubscript𝑡1𝑁subscript𝔼similar-to𝑥subscript𝑃𝑡superscript^𝑥superscript𝑥2\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x\sim P_{t}}(\widehat{h}(x)-h^{*}(x))^{2}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_h end_ARG ( italic_x ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 3γN+64(L+M)2log(2||/δ).absent3𝛾𝑁64superscript𝐿𝑀22𝛿\displaystyle\leq 3\gamma N+64(L+M)^{2}\log(2|\mathcal{H}|/\delta).≤ 3 italic_γ italic_N + 64 ( italic_L + italic_M ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 2 | caligraphic_H | / italic_δ ) .

Appendix C Useful Foundational Results ☕☕☕

We provide the following result highlighting the necessary characteristics for specific examples of the Fenchel conjugate functions φsuperscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Proposition 3 (φ𝜑\varphiitalic_φ-Divergence Bounds).

Let V[0,Vmax]|𝒮|𝑉superscript0subscript𝑉𝒮V\in[0,V_{\max}]^{|\mathcal{S}|}italic_V ∈ [ 0 , italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT be any value function and fix a probability distribution PoΔ(𝒮)superscript𝑃𝑜Δ𝒮P^{o}\in\Delta(\mathcal{S})italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∈ roman_Δ ( caligraphic_S ). Define h(y,η)=(λφ((ηy)/λ)η)𝑦𝜂𝜆superscript𝜑𝜂𝑦𝜆𝜂h(y,\eta)=(\lambda\varphi^{*}\left({(\eta-y)}/{\lambda}\right)-\eta)italic_h ( italic_y , italic_η ) = ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - italic_y ) / italic_λ ) - italic_η ). Consider the following scalar convex optimization problem: infηΘ𝔼sPoh(V(s),η)subscriptinfimum𝜂Θsubscript𝔼similar-to𝑠superscript𝑃𝑜𝑉𝑠𝜂\inf_{\eta\in\Theta\subseteq\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ ⊆ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_h ( italic_V ( italic_s ) , italic_η ). Let the maximum absolute value in ΘΘ\Thetaroman_Θ be less than or equal to c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, let |h(V(s),η)|c1𝑉𝑠𝜂subscript𝑐1|h(V(s),\eta)|\leq c_{1}| italic_h ( italic_V ( italic_s ) , italic_η ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for all ηΘ𝜂Θ\eta\in\Thetaitalic_η ∈ roman_Θ, and let h(V(s),η)𝑉𝑠𝜂h(V(s),\eta)italic_h ( italic_V ( italic_s ) , italic_η ) be c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-Lipschitz in η𝜂\etaitalic_η; hold for some positive constants c1,c2,c3subscript𝑐1subscript𝑐2subscript𝑐3c_{1},c_{2},c_{3}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. We have the following results for different forms of φ𝜑\varphiitalic_φ:
(i) Let Assumption 8 hold. For TV distance i.e. φ(t)=|t1|/2𝜑𝑡𝑡12\varphi(t)=|t-1|/2italic_φ ( italic_t ) = | italic_t - 1 | / 2, we have Θ[λ/2,λ/2]Θ𝜆2𝜆2\Theta\equiv[-\lambda/2,\lambda/2]roman_Θ ≡ [ - italic_λ / 2 , italic_λ / 2 ], hence c3=λ/2subscript𝑐3𝜆2c_{3}=\lambda/2italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_λ / 2, c1=2λ+Vmaxsubscript𝑐12𝜆subscript𝑉c_{1}=2\lambda+V_{\max}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 italic_λ + italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, and c2=2subscript𝑐22c_{2}=2italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2.
(ii) For chi-square divergence i.e. φ(t)=(t1)2𝜑𝑡superscript𝑡12\varphi(t)=(t-1)^{2}italic_φ ( italic_t ) = ( italic_t - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have Θ[λ,2Vmax+2λ]Θ𝜆2subscript𝑉2𝜆\Theta\equiv[-\lambda,2V_{\max}+2\lambda]roman_Θ ≡ [ - italic_λ , 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 2 italic_λ ], hence c3=2Vmax+2λsubscript𝑐32subscript𝑉2𝜆c_{3}=2V_{\max}+2\lambdaitalic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 2 italic_λ, c1=λ+(2Vmax+4λ)(2Vmax4λ+2)subscript𝑐1𝜆2subscript𝑉4𝜆2subscript𝑉4𝜆2c_{1}=\lambda+(2V_{\max}+4\lambda)(\frac{2V_{\max}}{4\lambda}+2)italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ + ( 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 4 italic_λ ) ( divide start_ARG 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_λ end_ARG + 2 ), and c2=(3+Vmaxλ)subscript𝑐23subscript𝑉𝜆c_{2}=(3+\frac{V_{\max}}{\lambda})italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 3 + divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ).
(iii) For KL divergence i.e. φ(t)=(t1)2𝜑𝑡superscript𝑡12\varphi(t)=(t-1)^{2}italic_φ ( italic_t ) = ( italic_t - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have Θ[λ,Vmax+λ]Θ𝜆subscript𝑉𝜆\Theta\equiv[\lambda,V_{\max}+\lambda]roman_Θ ≡ [ italic_λ , italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_λ ], hence c3=Vmax+λsubscript𝑐3subscript𝑉𝜆c_{3}=V_{\max}+\lambdaitalic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_λ, c1=λ(exp(Vmaxλ)1)subscript𝑐1𝜆subscript𝑉𝜆1c_{1}=\lambda(\exp(\frac{V_{\max}}{\lambda})-1)italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ ( roman_exp ( divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) - 1 ), and c2=(exp(Vmaxλ)+1)subscript𝑐2subscript𝑉𝜆1c_{2}=(\exp(\frac{V_{\max}}{\lambda})+1)italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( roman_exp ( divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) + 1 ).
(iv) Fix α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ). For α𝛼\alphaitalic_α-CVaR i.e. φ(t)=𝟙[0,1/α)𝜑𝑡101𝛼\varphi(t)=\mathds{1}[0,1/\alpha)italic_φ ( italic_t ) = blackboard_1 [ 0 , 1 / italic_α ), we have Θ[0,Vmax/(1α)]Θ0subscript𝑉1𝛼\Theta\equiv[0,V_{\max}/(1-\alpha)]roman_Θ ≡ [ 0 , italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / ( 1 - italic_α ) ], hence c3=Vmax/(1α)subscript𝑐3subscript𝑉1𝛼c_{3}=V_{\max}/(1-\alpha)italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / ( 1 - italic_α ), c1=2Vmax/(α(1α))subscript𝑐12subscript𝑉𝛼1𝛼c_{1}=2V_{\max}/(\alpha(1-\alpha))italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / ( italic_α ( 1 - italic_α ) ), and c2=1+α1subscript𝑐21superscript𝛼1c_{2}=1+\alpha^{-1}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Proof.

We first prove the statement for TV distance with φ(t)=|t1|/2𝜑𝑡𝑡12\varphi(t)=|t-1|/2italic_φ ( italic_t ) = | italic_t - 1 | / 2. From φ𝜑\varphiitalic_φ-divergence literature (Xu et al.,, 2023), we know

φ(s)={12s12,ss[12,12]+s>12..superscript𝜑𝑠cases12𝑠12𝑠𝑠1212𝑠12\varphi^{*}(s)=\begin{cases}-\frac{1}{2}&s\leq-\frac{1}{2},\\ s&s\in[-\frac{1}{2},\frac{1}{2}]\\ +\infty&s>\frac{1}{2}.\end{cases}.italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = { start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL start_CELL italic_s ≤ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , end_CELL end_ROW start_ROW start_CELL italic_s end_CELL start_CELL italic_s ∈ [ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ] end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL italic_s > divide start_ARG 1 end_ARG start_ARG 2 end_ARG . end_CELL end_ROW .

Thus, we have

infη𝔼sPoh(V(s),η)subscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜𝑉𝑠𝜂\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_h ( italic_V ( italic_s ) , italic_η ) =infη𝔼sPo[λφ(ηV(s)λ)]ηabsentsubscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆superscript𝜑𝜂𝑉𝑠𝜆𝜂\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta= roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG ) ] - italic_η
=(a)infη,ηminsV(s)λ12𝔼sPo[λmax{ηV(s)λ,12}]ηsuperscript𝑎absentsubscriptinfimumformulae-sequence𝜂𝜂subscript𝑠𝑉𝑠𝜆12subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆𝜂𝑉𝑠𝜆12𝜂\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\inf_{\eta\in\mathbb{R},\frac{% \eta-\min_{s}V(s)}{\lambda}\leq\frac{1}{2}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \max\{\frac{\eta-V(s)}{\lambda},-\frac{1}{2}\}]-\etastart_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R , divide start_ARG italic_η - roman_min start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ roman_max { divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG , - divide start_ARG 1 end_ARG start_ARG 2 end_ARG } ] - italic_η
=(b)infη,ηλ12𝔼sPo[λmax{ηV(s)λ,12}]ηsuperscript𝑏absentsubscriptinfimumformulae-sequence𝜂𝜂𝜆12subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆𝜂𝑉𝑠𝜆12𝜂\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\inf_{\eta\in\mathbb{R},\frac{% \eta}{\lambda}\leq\frac{1}{2}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda\max\{\frac{% \eta-V(s)}{\lambda},-\frac{1}{2}\}]-\etastart_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_b ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R , divide start_ARG italic_η end_ARG start_ARG italic_λ end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ roman_max { divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG , - divide start_ARG 1 end_ARG start_ARG 2 end_ARG } ] - italic_η
=(c)infη,ηλ12𝔼sPo[(ηV(s)+λ/2)+]λ/2ηsuperscript𝑐absentsubscriptinfimumformulae-sequence𝜂𝜂𝜆12subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscript𝜂𝑉𝑠𝜆2𝜆2𝜂\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}\inf_{\eta\in\mathbb{R},\frac{% \eta}{\lambda}\leq\frac{1}{2}}~{}\mathbb{E}_{s\sim P^{o}}[(\eta-V(s)+\lambda/2% )_{+}]-\lambda/2-\etastart_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_c ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R , divide start_ARG italic_η end_ARG start_ARG italic_λ end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η - italic_V ( italic_s ) + italic_λ / 2 ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_λ / 2 - italic_η
=(d)infη,ηλ𝔼sPo[(ηV(s))+]ηsuperscript𝑑absentsubscriptinfimumformulae-sequencesuperscript𝜂superscript𝜂𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscriptsuperscript𝜂𝑉𝑠superscript𝜂\displaystyle\stackrel{{\scriptstyle(d)}}{{=}}\inf_{\eta^{\prime}\in\mathbb{R}% ,\eta^{\prime}\leq\lambda}~{}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}% ]-\eta^{\prime}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_d ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R , italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_λ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
=(e)inf0ηλ𝔼sPo[(ηV(s))+]η,superscript𝑒absentsubscriptinfimum0superscript𝜂𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscriptsuperscript𝜂𝑉𝑠superscript𝜂\displaystyle\stackrel{{\scriptstyle(e)}}{{=}}\inf_{0\leq\eta^{\prime}\leq% \lambda}~{}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}]-\eta^{\prime},start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_e ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT 0 ≤ italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_λ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (17)

where (a)𝑎(a)( italic_a ) follows by definition of φsuperscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, (b)𝑏(b)( italic_b ) by Assumption 8, (c)𝑐(c)( italic_c ) by the fact max{x,y}=(xy)++y𝑥𝑦subscript𝑥𝑦𝑦\max\{x,y\}=(x-y)_{+}+yroman_max { italic_x , italic_y } = ( italic_x - italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + italic_y for any x,y𝑥𝑦x,y\in\mathbb{R}italic_x , italic_y ∈ blackboard_R, and (d)𝑑(d)( italic_d ) follows by making the substitution η=ηλ/2𝜂superscript𝜂𝜆2\eta=\eta^{\prime}-\lambda/2italic_η = italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_λ / 2. Finally, for (e)𝑒(e)( italic_e ), notice that since V(s)0,𝑉𝑠0V(s)\geq 0,italic_V ( italic_s ) ≥ 0 , 𝔼sPo[(ηV(s))+]η=η0subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscriptsuperscript𝜂𝑉𝑠superscript𝜂superscript𝜂0\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}]-\eta^{\prime}=-\eta^{\prime% }\geq 0blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 0 holds when η0superscript𝜂0\eta^{\prime}\leq 0italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ 0. So infη(,0]𝔼sPo[(ηV(s))+]η=0subscriptinfimumsuperscript𝜂0subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscriptsuperscript𝜂𝑉𝑠superscript𝜂0\inf_{\eta^{\prime}\in(-\infty,0]}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s)% )_{+}]-\eta^{\prime}=0roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( - ∞ , 0 ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 is achieved at η=0superscript𝜂0\eta^{\prime}=0italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.

We immediately have Θ[λ/2,λ/2]Θ𝜆2𝜆2\Theta\equiv[-\lambda/2,\lambda/2]roman_Θ ≡ [ - italic_λ / 2 , italic_λ / 2 ] since η=ηλ/2𝜂superscript𝜂𝜆2\eta=\eta^{\prime}-\lambda/2italic_η = italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_λ / 2. Since ηλ/2𝜂𝜆2\eta\leq\lambda/2italic_η ≤ italic_λ / 2 and V(s)Vmax𝑉𝑠subscript𝑉V(s)\leq V_{\max}italic_V ( italic_s ) ≤ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, we further get |h(V(s),η)|2λ+Vmax𝑉𝑠𝜂2𝜆subscript𝑉|h(V(s),\eta)|\leq 2\lambda+V_{\max}| italic_h ( italic_V ( italic_s ) , italic_η ) | ≤ 2 italic_λ + italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. For η1,η2Θsubscript𝜂1subscript𝜂2Θ\eta_{1},\eta_{2}\in\Thetaitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Θ, from the fact |(x)+(y)+||(xy)+||xy|subscript𝑥subscript𝑦subscript𝑥𝑦𝑥𝑦|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|\leq|x-y|| ( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≤ | ( italic_x - italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≤ | italic_x - italic_y | we have |h(V(s),η1)h(V(s),η2)|2|η1η2|𝑉𝑠subscript𝜂1𝑉𝑠subscript𝜂22subscript𝜂1subscript𝜂2|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq 2|\eta_{1}-\eta_{2}|| italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ 2 | italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |. This proves statement (i)𝑖(i)( italic_i ).

We now prove the statement for chi-square divergence with φ(t)=(t1)2𝜑𝑡superscript𝑡12\varphi(t)=(t-1)^{2}italic_φ ( italic_t ) = ( italic_t - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT following similar steps as before. From φ𝜑\varphiitalic_φ-divergence literature (Xu et al.,, 2023), we know φ(s)=(s/2+1)+21.superscript𝜑𝑠superscriptsubscript𝑠2121\varphi^{*}(s)=(s/2+1)_{+}^{2}-1.italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = ( italic_s / 2 + 1 ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 . Thus, we have

infη𝔼sPoh(V(s),η)subscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜𝑉𝑠𝜂\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_h ( italic_V ( italic_s ) , italic_η ) =infη𝔼sPo[λφ(ηV(s)λ)]ηabsentsubscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆superscript𝜑𝜂𝑉𝑠𝜆𝜂\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta= roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG ) ] - italic_η
=infη𝔼sPo[λ(ηV(s)2λ+1)+2]ληabsentsubscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆superscriptsubscript𝜂𝑉𝑠2𝜆12𝜆𝜂\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda(% \frac{\eta-V(s)}{2\lambda}+1)_{+}^{2}]-\lambda-\eta= roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ ( divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG 2 italic_λ end_ARG + 1 ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_λ - italic_η
=(f)infη14λ𝔼sPo[(ηV(s))+2]+ληsuperscript𝑓absentsubscriptinfimumsuperscript𝜂14𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]superscriptsubscriptsuperscript𝜂𝑉𝑠2𝜆superscript𝜂\displaystyle\stackrel{{\scriptstyle(f)}}{{=}}\inf_{\eta^{\prime}\in\mathbb{R}% }~{}\frac{1}{4\lambda}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}^{2}]+% \lambda-\eta^{\prime}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_f ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 italic_λ end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_λ - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
=(g)infη[λ,2Vmax+4λ]14λ𝔼sPo[(ηV(s))+2]+λη,superscript𝑔absentsubscriptinfimumsuperscript𝜂𝜆2subscript𝑉4𝜆14𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]superscriptsubscriptsuperscript𝜂𝑉𝑠2𝜆superscript𝜂\displaystyle\stackrel{{\scriptstyle(g)}}{{=}}\inf_{\eta^{\prime}\in[\lambda,2% V_{\max}+4\lambda]}~{}\frac{1}{4\lambda}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime% }-V(s))_{+}^{2}]+\lambda-\eta^{\prime},start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_g ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_λ , 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 4 italic_λ ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 4 italic_λ end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_λ - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

where (f)𝑓(f)( italic_f ) follows by making the substitution η=η2λ𝜂superscript𝜂2𝜆\eta=\eta^{\prime}-2\lambdaitalic_η = italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 2 italic_λ. Finally, for (g)𝑔(g)( italic_g ), observe that the function g(η)=14λ𝔼sPo[(ηV(s))+2]+λη𝑔superscript𝜂14𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]superscriptsubscriptsuperscript𝜂𝑉𝑠2𝜆superscript𝜂g(\eta^{\prime})=\frac{1}{4\lambda}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s% ))_{+}^{2}]+\lambda-\eta^{\prime}italic_g ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 4 italic_λ end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_λ - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is convex in the dual variable ηsuperscript𝜂\eta^{\prime}italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and infηg(η)0subscriptinfimumsuperscript𝜂𝑔superscript𝜂0\inf_{\eta^{\prime}\in\mathbb{R}}g(\eta^{\prime})\leq 0roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R end_POSTSUBSCRIPT italic_g ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ 0 since it is a Lagrangian dual variable. Since V(s)0,𝑉𝑠0V(s)\geq 0,italic_V ( italic_s ) ≥ 0 , λη0𝜆subscriptsuperscript𝜂0\lambda-\eta^{\prime}_{*}\leq 0italic_λ - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ 0 where ηsubscriptsuperscript𝜂\eta^{\prime}_{*}italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is any solution of infηg(η)0subscriptinfimumsuperscript𝜂𝑔superscript𝜂0\inf_{\eta^{\prime}\in\mathbb{R}}g(\eta^{\prime})\leq 0roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R end_POSTSUBSCRIPT italic_g ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ 0. When η2Vmax+4λsuperscript𝜂2subscript𝑉4𝜆\eta^{\prime}\geq 2V_{\max}+4\lambdaitalic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 4 italic_λ, notice that g(η)14λ(η22(Vmax+2λ)η+4λ2)λ>0,𝑔superscript𝜂14𝜆superscript𝜂22subscript𝑉2𝜆superscript𝜂4superscript𝜆2𝜆0g(\eta^{\prime})\geq\frac{1}{4\lambda}(\eta^{\prime 2}-2(V_{\max}+2\lambda)% \eta^{\prime}+4\lambda^{2})\geq\lambda>0,italic_g ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 4 italic_λ end_ARG ( italic_η start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT - 2 ( italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 2 italic_λ ) italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 4 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ italic_λ > 0 , since 0V(s)Vmax0𝑉𝑠subscript𝑉0\leq V(s)\leq V_{\max}0 ≤ italic_V ( italic_s ) ≤ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

We immediately have Θ[λ,2Vmax+2λ]Θ𝜆2subscript𝑉2𝜆\Theta\equiv[-\lambda,2V_{\max}+2\lambda]roman_Θ ≡ [ - italic_λ , 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 2 italic_λ ] since η=η2λ𝜂superscript𝜂2𝜆\eta=\eta^{\prime}-2\lambdaitalic_η = italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 2 italic_λ. Since η2Vmax+2λ𝜂2subscript𝑉2𝜆\eta\leq 2V_{\max}+2\lambdaitalic_η ≤ 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 2 italic_λ and V(s)0𝑉𝑠0V(s)\geq 0italic_V ( italic_s ) ≥ 0, we further get |h(V(s),η)|λ+(2Vmax+4λ)(2Vmax4λ+2)𝑉𝑠𝜂𝜆2subscript𝑉4𝜆2subscript𝑉4𝜆2|h(V(s),\eta)|\leq\lambda+(2V_{\max}+4\lambda)(\frac{2V_{\max}}{4\lambda}+2)| italic_h ( italic_V ( italic_s ) , italic_η ) | ≤ italic_λ + ( 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 4 italic_λ ) ( divide start_ARG 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_λ end_ARG + 2 ). For η1,η2Θsubscript𝜂1subscript𝜂2Θ\eta_{1},\eta_{2}\in\Thetaitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Θ, from the facts |(x)+(y)+||(xy)+||xy|subscript𝑥subscript𝑦subscript𝑥𝑦𝑥𝑦|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|\leq|x-y|| ( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≤ | ( italic_x - italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≤ | italic_x - italic_y | and |(x)+2(y)+2|=|(x)+(y)+|((x)++(y)+)superscriptsubscript𝑥2superscriptsubscript𝑦2subscript𝑥subscript𝑦subscript𝑥subscript𝑦|(x)_{+}^{2}-(y)_{+}^{2}|=|(x)_{+}-(y)_{+}|((x)_{+}+(y)_{+})| ( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | = | ( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ( ( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ), we have |h(V(s),η1)h(V(s),η2)|(3+(Vmax))|η1η2|𝑉𝑠subscript𝜂1𝑉𝑠subscript𝜂23subscript𝑉subscript𝜂1subscript𝜂2|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq(3+(V_{\max}))|\eta_{1}-\eta_{2}|| italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ ( 3 + ( italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ) | italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |. This proves statement (ii)𝑖𝑖(ii)( italic_i italic_i ).

We now prove the statement for KL divergence with φ(t)=tlogt𝜑𝑡𝑡𝑡\varphi(t)=t\log{t}italic_φ ( italic_t ) = italic_t roman_log italic_t following similar steps as before. From φ𝜑\varphiitalic_φ-divergence literature (Xu et al.,, 2023), we know φ(s)=exp(s1).superscript𝜑𝑠𝑠1\varphi^{*}(s)=\exp(s-1).italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = roman_exp ( italic_s - 1 ) . Thus, we have

infη𝔼sPoh(V(s),η)subscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜𝑉𝑠𝜂\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_h ( italic_V ( italic_s ) , italic_η ) =infη𝔼sPo[λφ(ηV(s)λ)]ηabsentsubscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆superscript𝜑𝜂𝑉𝑠𝜆𝜂\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta= roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG ) ] - italic_η
=infη𝔼sPo[λexp(ηV(s)λ1)]ηabsentsubscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆𝜂𝑉𝑠𝜆1𝜂\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda\exp(% \frac{\eta-V(s)}{\lambda}-1)]-\eta= roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ roman_exp ( divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG - 1 ) ] - italic_η
=(h)infηλ𝔼sPo[exp(ηV(s)λ1)]+ηsuperscriptabsentsubscriptinfimumsuperscript𝜂𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]superscript𝜂𝑉𝑠𝜆1superscript𝜂\displaystyle\stackrel{{\scriptstyle(h)}}{{=}}\inf_{\eta^{\prime}\in\mathbb{R}% }~{}\lambda\mathbb{E}_{s\sim P^{o}}[\exp(\frac{-\eta^{\prime}-V(s)}{\lambda}-1% )]+\eta^{\prime}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_h ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R end_POSTSUBSCRIPT italic_λ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_exp ( divide start_ARG - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG - 1 ) ] + italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
=(j)infη[λVmax,λ]λ𝔼sPo[exp(ηV(s)λ1)]+η,superscript𝑗absentsubscriptinfimumsuperscript𝜂𝜆subscript𝑉𝜆𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]superscript𝜂𝑉𝑠𝜆1superscript𝜂\displaystyle\stackrel{{\scriptstyle(j)}}{{=}}\inf_{\eta^{\prime}\in[-\lambda-% V_{\max},-\lambda]}~{}\lambda\mathbb{E}_{s\sim P^{o}}[\exp(\frac{-\eta^{\prime% }-V(s)}{\lambda}-1)]+\eta^{\prime},start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_j ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ - italic_λ - italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , - italic_λ ] end_POSTSUBSCRIPT italic_λ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_exp ( divide start_ARG - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG - 1 ) ] + italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

where (h)(h)( italic_h ) follows by making the substitution η=η𝜂superscript𝜂\eta=-\eta^{\prime}italic_η = - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Finally, for (j)𝑗(j)( italic_j ), observe that the function g(η)=λ𝔼sPo[exp(ηV(s)λ1)]+η𝑔superscript𝜂𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]superscript𝜂𝑉𝑠𝜆1superscript𝜂g(\eta^{\prime})=\lambda\mathbb{E}_{s\sim P^{o}}[\exp(\frac{-\eta^{\prime}-V(s% )}{\lambda}-1)]+\eta^{\prime}italic_g ( italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_λ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_exp ( divide start_ARG - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG - 1 ) ] + italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is convex in the dual variable ηsuperscript𝜂\eta^{\prime}italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT since it is a Lagrangian dual variable. From Calculus, the optimal η=λ+λlog𝔼sPoexp(V(s)/λ)superscript𝜂𝜆𝜆subscript𝔼similar-to𝑠superscript𝑃𝑜𝑉𝑠𝜆\eta^{\prime}=-\lambda+\lambda\log\mathbb{E}_{s\sim P^{o}}\exp({-V(s)}/{% \lambda})italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_λ + italic_λ roman_log blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( - italic_V ( italic_s ) / italic_λ ). So η[λVmax,λ]superscript𝜂𝜆subscript𝑉𝜆\eta^{\prime}\in[-\lambda-V_{\max},-\lambda]italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ - italic_λ - italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , - italic_λ ] since 0V(s)Vmax0𝑉𝑠subscript𝑉0\leq V(s)\leq V_{\max}0 ≤ italic_V ( italic_s ) ≤ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

We immediately have Θ[λ,Vmax+λ]Θ𝜆subscript𝑉𝜆\Theta\equiv[\lambda,V_{\max}+\lambda]roman_Θ ≡ [ italic_λ , italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_λ ] since η=η𝜂superscript𝜂\eta=-\eta^{\prime}italic_η = - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since ηVmax+λ𝜂subscript𝑉𝜆\eta\leq V_{\max}+\lambdaitalic_η ≤ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_λ and V(s)0𝑉𝑠0V(s)\geq 0italic_V ( italic_s ) ≥ 0, we further get |h(V(s),η)|λ(exp(Vmaxλ)1)𝑉𝑠𝜂𝜆subscript𝑉𝜆1|h(V(s),\eta)|\leq\lambda(\exp(\frac{V_{\max}}{\lambda})-1)| italic_h ( italic_V ( italic_s ) , italic_η ) | ≤ italic_λ ( roman_exp ( divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) - 1 ). For η1,η2Θsubscript𝜂1subscript𝜂2Θ\eta_{1},\eta_{2}\in\Thetaitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Θ, from the fact exp(x)𝑥\exp(-x)roman_exp ( - italic_x ) is 1111-Lipschitz for x0𝑥0x\geq 0italic_x ≥ 0, we have |h(V(s),η1)h(V(s),η2)|(exp(Vmaxλ)+1)|η1η2|𝑉𝑠subscript𝜂1𝑉𝑠subscript𝜂2subscript𝑉𝜆1subscript𝜂1subscript𝜂2|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq(\exp(\frac{V_{\max}}{\lambda})+1)|\eta% _{1}-\eta_{2}|| italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ ( roman_exp ( divide start_ARG italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) + 1 ) | italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |. This proves statement (ii)𝑖𝑖(ii)( italic_i italic_i ).

We now prove the statement for α𝛼\alphaitalic_α-CVAR with φ(t)=𝟙[0,1/α)𝜑𝑡101𝛼\varphi(t)=\mathds{1}[0,1/\alpha)italic_φ ( italic_t ) = blackboard_1 [ 0 , 1 / italic_α ). From φ𝜑\varphiitalic_φ-divergence literature (Levy et al.,, 2020), we know φ(s)=(s)+/α.superscript𝜑𝑠subscript𝑠𝛼\varphi^{*}(s)=(s)_{+}/\alpha.italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = ( italic_s ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / italic_α . Thus, we have

infη𝔼sPoh(V(s),η)subscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜𝑉𝑠𝜂\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_h ( italic_V ( italic_s ) , italic_η ) =infη𝔼sPo[λφ(ηV(s)λ)]ηabsentsubscriptinfimum𝜂subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]𝜆superscript𝜑𝜂𝑉𝑠𝜆𝜂\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta= roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG italic_η - italic_V ( italic_s ) end_ARG start_ARG italic_λ end_ARG ) ] - italic_η
=infη1α𝔼sPo[(ηV(s))+]ηabsentsubscriptinfimum𝜂1𝛼subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscript𝜂𝑉𝑠𝜂\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\frac{1}{\alpha}\mathbb{E}_{s\sim P^{% o}}[(\eta-V(s))_{+}]-\eta= roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_α end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η
=(k)inf0ηVmax/(1α)1α𝔼sPo[(ηV(s))+]η.superscript𝑘absentsubscriptinfimum0𝜂subscript𝑉1𝛼1𝛼subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscript𝜂𝑉𝑠𝜂\displaystyle\stackrel{{\scriptstyle(k)}}{{=}}\inf_{0\leq\eta\leq V_{\max}/(1-% \alpha)}~{}\frac{1}{\alpha}\mathbb{E}_{s\sim P^{o}}[(\eta-V(s))_{+}]-\eta.start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_k ) end_ARG end_RELOP roman_inf start_POSTSUBSCRIPT 0 ≤ italic_η ≤ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / ( 1 - italic_α ) end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_α end_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η . (18)

For (k)𝑘(k)( italic_k ), notice that since V(s)0,𝑉𝑠0V(s)\geq 0,italic_V ( italic_s ) ≥ 0 , (1/α)𝔼sPo[(ηV(s))+]η=η01𝛼subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscript𝜂𝑉𝑠𝜂𝜂0(1/\alpha)\mathbb{E}_{s\sim P^{o}}[(\eta-V(s))_{+}]-\eta=-\eta\geq 0( 1 / italic_α ) blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η = - italic_η ≥ 0 holds when η0𝜂0\eta\leq 0italic_η ≤ 0. Also, since V(s)Vmax,𝑉𝑠subscript𝑉V(s)\leq V_{\max},italic_V ( italic_s ) ≤ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , (1/α)𝔼sPo[(ηV(s))+]η01𝛼subscript𝔼similar-to𝑠superscript𝑃𝑜delimited-[]subscript𝜂𝑉𝑠𝜂0(1/\alpha)\mathbb{E}_{s\sim P^{o}}[(\eta-V(s))_{+}]-\eta\geq 0( 1 / italic_α ) blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_η - italic_V ( italic_s ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η ≥ 0 holds when ηVmax/(1α)𝜂subscript𝑉1𝛼\eta\geq V_{\max}/(1-\alpha)italic_η ≥ italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / ( 1 - italic_α ).

We immediately have Θ[0,Vmax/(1α)]Θ0subscript𝑉1𝛼\Theta\equiv[0,V_{\max}/(1-\alpha)]roman_Θ ≡ [ 0 , italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / ( 1 - italic_α ) ]. We further get |h(V(s),η)|2Vmax/(α(1α))𝑉𝑠𝜂2subscript𝑉𝛼1𝛼|h(V(s),\eta)|\leq 2V_{\max}/(\alpha(1-\alpha))| italic_h ( italic_V ( italic_s ) , italic_η ) | ≤ 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / ( italic_α ( 1 - italic_α ) ). For η1,η2Θsubscript𝜂1subscript𝜂2Θ\eta_{1},\eta_{2}\in\Thetaitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Θ, from the fact |(x)+(y)+||(xy)+||xy|subscript𝑥subscript𝑦subscript𝑥𝑦𝑥𝑦|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|\leq|x-y|| ( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≤ | ( italic_x - italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≤ | italic_x - italic_y | we have |h(V(s),η1)h(V(s),η2)|(1+α1)|η1η2|𝑉𝑠subscript𝜂1𝑉𝑠subscript𝜂21superscript𝛼1subscript𝜂1subscript𝜂2|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq(1+\alpha^{-1})|\eta_{1}-\eta_{2}|| italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_h ( italic_V ( italic_s ) , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ ( 1 + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) | italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |. This proves the final statement of this result. ∎

We now state and prove a generalization bound for empirical risk minimization when the data are not necessarily i.i.d. but adapted to a stochastic process. This result is of independent interest to more machine learning problems outside of the scope of this paper as well. Furthermore, this result showcases better rate dependence on N𝑁Nitalic_N, from 𝒪~(1/N)~𝒪1𝑁\widetilde{\mathcal{O}}(1/\sqrt{N})over~ start_ARG caligraphic_O end_ARG ( 1 / square-root start_ARG italic_N end_ARG ) to 𝒪~(1/N)~𝒪1𝑁\widetilde{\mathcal{O}}(1/{N})over~ start_ARG caligraphic_O end_ARG ( 1 / italic_N ), than the classical result Lemma 4 (Shalev-Shwartz and Ben-David,, 2014). This result is not surprising and we refer to Van Erven et al., (2015, Theorems 7.6 & 5.4), in the i.i.d. setting, for such 𝒪~(1/N)~𝒪1𝑁\widetilde{\mathcal{O}}(1/{N})over~ start_ARG caligraphic_O end_ARG ( 1 / italic_N ) fast rates with bounded losses to empirical risk minimization and beyond.

Proposition 4 (Online ERM Generalization Bound).

Let N>0𝑁0N>0italic_N > 0, δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), let 𝒳𝒳\mathcal{X}caligraphic_X be an input space, and let 𝒴𝒴\mathcal{Y}caligraphic_Y be the target functional space. Let 𝒴𝒴\mathcal{H}\subseteq\mathcal{Y}caligraphic_H ⊆ caligraphic_Y be the given finite class of functions. Assume that for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and hh\in\mathcal{H}italic_h ∈ caligraphic_H for loss function l𝑙litalic_l we have that |l(h(x))|c𝑙𝑥𝑐|l(h(x))|\leq c| italic_l ( italic_h ( italic_x ) ) | ≤ italic_c for some positive constant c>0𝑐0c>0italic_c > 0. Given a dataset 𝒟={xi}i=1N𝒟superscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\mathcal{D}=\{x_{i}\}_{i=1}^{N}caligraphic_D = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, denote h^^\widehat{h}over^ start_ARG italic_h end_ARG as the ERM solution, i.e. h^argminhi=1Nl(h(xi))^subscriptargminsuperscriptsubscript𝑖1𝑁𝑙subscript𝑥𝑖\widehat{h}\leftarrow\operatorname*{arg\,min}_{h\in\mathcal{H}}\sum_{i=1}^{N}l% (h(x_{i}))over^ start_ARG italic_h end_ARG ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). The dataset 𝒟𝒟\mathcal{D}caligraphic_D is generated as xtPtsimilar-tosubscript𝑥𝑡subscript𝑃𝑡x_{t}\sim P_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from some stochastic process Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that depends on the history {x1,,xt1}subscript𝑥1subscript𝑥𝑡1\{x_{1},\dots,x_{t-1}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }, where the function htargminf𝒴𝔼xPt[l(f(x))]subscriptsuperscript𝑡subscriptargmin𝑓𝒴subscript𝔼similar-to𝑥subscript𝑃𝑡delimited-[]𝑙𝑓𝑥h^{*}_{t}\in\operatorname*{arg\,min}_{f\in\mathcal{Y}}\mathbb{E}_{x\sim P_{t}}% [l(f(x))]italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_f ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_l ( italic_f ( italic_x ) ) ] satisfies approximate realizability i.e.

infh1Nt=1N𝔼xtPt(l(h(xt))l(ht(xt)))γ,subscriptinfimum1𝑁superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡𝛾\inf_{h\in\mathcal{H}}\frac{1}{N}\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(% h(x_{t}))-l(h^{*}_{t}(x_{t})))\leq\gamma,roman_inf start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ≤ italic_γ ,

and for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, |l(ht(x))|c𝑙subscriptsuperscript𝑡𝑥𝑐|l(h^{*}_{t}(x))|\leq c| italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) | ≤ italic_c. Then, the ERM solution satisfies

t=1N𝔼xtPtl(h^(xt))t=1N𝔼xtPtl(ht(xt))superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙^subscript𝑥𝑡superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}l(\widehat{h}(x_{t}))-% \sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}l(h^{*}_{t}(x_{t}))∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( over^ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) 3γN+48clog(2||/δ)absent3𝛾𝑁48𝑐2𝛿\displaystyle\leq 3\gamma N+48c\log(2\left\lvert\mathcal{H}\right\rvert/\delta)≤ 3 italic_γ italic_N + 48 italic_c roman_log ( 2 | caligraphic_H | / italic_δ )

with probability at least 1δ1𝛿1-\delta1 - italic_δ.

Proof.

We adapt the proof of least-squares generalization bound (Song et al.,, 2023, Lemma 3) here for the empirical risk minimization generalization bound under online data collection. Fix any function hh\in\mathcal{H}italic_h ∈ caligraphic_H. We define the random variable Zth=l(h(xt))l(ht(xt)).superscriptsubscript𝑍𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡Z_{t}^{h}=l(h(x_{t}))-l(h^{*}_{t}(x_{t})).italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . Immediately, we note |Zth|2csuperscriptsubscript𝑍𝑡2𝑐|Z_{t}^{h}|\leq 2c| italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT | ≤ 2 italic_c for all t𝑡titalic_t. By definition of htsubscriptsuperscript𝑡h^{*}_{t}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have a non-negative first moment of Zthsuperscriptsubscript𝑍𝑡Z_{t}^{h}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT:

𝔼Pt[Zth]subscript𝔼subscript𝑃𝑡delimited-[]subscriptsuperscript𝑍𝑡\displaystyle\mathbb{E}_{P_{t}}[Z^{h}_{t}]blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =𝔼xtPtl(h(xt))𝔼xtPtl(ht(xt)).absentsubscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\displaystyle=\mathbb{E}_{x_{t}\sim P_{t}}l(h(x_{t}))-\mathbb{E}_{x_{t}\sim P_% {t}}l(h^{*}_{t}(x_{t})).= blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (19)

By symmetrization, assuming l(ht(xt))2l(h(xt))2𝑙superscriptsubscriptsuperscript𝑡subscript𝑥𝑡2𝑙superscriptsubscript𝑥𝑡2l(h^{*}_{t}(x_{t}))^{2}\leq l(h(x_{t}))^{2}italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have that

0𝔼Pt[(Zth)2]0subscript𝔼subscript𝑃𝑡delimited-[]superscriptsuperscriptsubscript𝑍𝑡2\displaystyle 0\leq\mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}]0 ≤ blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 𝔼xtPt[2l(h(xt))22l(h(xt))l(ht(xt))]absentsubscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡delimited-[]2𝑙superscriptsubscript𝑥𝑡22𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\displaystyle\leq\mathbb{E}_{x_{t}\sim P_{t}}[2l(h(x_{t}))^{2}-2\cdot l(h(x_{t% }))\cdot l(h^{*}_{t}(x_{t}))]≤ blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ 2 italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⋅ italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ⋅ italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]
2|l(h(xt))|𝔼xtPt(l(h(xt))l(ht(xt)))absent2𝑙subscript𝑥𝑡subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\displaystyle\leq 2|l(h(x_{t}))|\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{% *}_{t}(x_{t})))≤ 2 | italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) | blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) )
2c𝔼xtPt(l(h(xt))l(ht(xt))).absent2𝑐subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\displaystyle\leq 2c\cdot\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(% x_{t}))).≤ 2 italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) .

Similarly assuming l(ht(xt))2l(h(xt))2𝑙superscriptsubscriptsuperscript𝑡subscript𝑥𝑡2𝑙superscriptsubscript𝑥𝑡2l(h^{*}_{t}(x_{t}))^{2}\geq l(h(x_{t}))^{2}italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we get 0𝔼Pt[(Zth)2]2c𝔼xtPt(l(h(xt))l(ht(xt)))0subscript𝔼subscript𝑃𝑡delimited-[]superscriptsuperscriptsubscript𝑍𝑡22𝑐subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡0\leq\mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}]\leq 2c\cdot\mathbb{E}_{x_{t}\sim P_{t% }}(l(h(x_{t}))-l(h^{*}_{t}(x_{t})))0 ≤ blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 2 italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ). Thus, uniformly, we have

0𝔼Pt[(Zth)2]2c𝔼xtPt(l(h(xt))l(ht(xt))).0subscript𝔼subscript𝑃𝑡delimited-[]superscriptsuperscriptsubscript𝑍𝑡22𝑐subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\displaystyle 0\leq\mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}]\leq 2c\cdot\mathbb{E}_{% x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(x_{t}))).0 ≤ blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 2 italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) . (20)

We remark that (20) is called Bernstein condition (Van Erven et al.,, 2015, Definition 5.1) when all sampling distributions Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s are identical. This is one of the sufficient conditions on the loss functions to get 𝒪(1/N)𝒪1𝑁\mathcal{O}(1/N)caligraphic_O ( 1 / italic_N )-generalization bounds for empirical risk minimization.

Now, applying Lemma 3 with λ[0,1/4c]𝜆014𝑐\lambda\in[0,1/4c]italic_λ ∈ [ 0 , 1 / 4 italic_c ] and δ>0𝛿0\delta>0italic_δ > 0, we have

|t=1NZth𝔼Pt[Zth]|superscriptsubscript𝑡1𝑁subscriptsuperscript𝑍𝑡subscript𝔼subscript𝑃𝑡delimited-[]superscriptsubscript𝑍𝑡\displaystyle\left\lvert\sum_{t=1}^{N}Z^{h}_{t}-\mathbb{E}_{P_{t}}[Z_{t}^{h}]\right\rvert| ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ] | λt=1N(4c|𝔼Pt[Zth]|+𝔼Pt[(Zth)2])+log(2/δ)λabsent𝜆superscriptsubscript𝑡1𝑁4𝑐subscript𝔼subscript𝑃𝑡delimited-[]superscriptsubscript𝑍𝑡subscript𝔼subscript𝑃𝑡delimited-[]superscriptsuperscriptsubscript𝑍𝑡22𝛿𝜆\displaystyle\leq\lambda\sum_{t=1}^{N}(4c|\mathbb{E}_{P_{t}}[Z_{t}^{h}]|+% \mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}])+\frac{\log(2/\delta)}{\lambda}≤ italic_λ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 4 italic_c | blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ] | + blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) + divide start_ARG roman_log ( 2 / italic_δ ) end_ARG start_ARG italic_λ end_ARG
6cλt=1N𝔼xtPt(l(h(xt))l(ht(xt)))+log(2/δ)λabsent6𝑐𝜆superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡2𝛿𝜆\displaystyle\leq 6c\lambda\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t% }))-l(h^{*}_{t}(x_{t})))+\frac{\log(2/\delta)}{\lambda}≤ 6 italic_c italic_λ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) + divide start_ARG roman_log ( 2 / italic_δ ) end_ARG start_ARG italic_λ end_ARG

with probability at least 1δ1𝛿1-\delta1 - italic_δ, where the last inequality uses (19) and (20). We set λ=1/12c𝜆112𝑐\lambda=1/12citalic_λ = 1 / 12 italic_c in the above, we get for any hh\in\mathcal{H}italic_h ∈ caligraphic_H, with probability at least 1δ1𝛿1-\delta1 - italic_δ:

|t=1NZth𝔼Pt[Zth]|12t=1N𝔼xtPt(l(h(xt))l(ht(xt)))+12clog(2||/δ),superscriptsubscript𝑡1𝑁subscriptsuperscript𝑍𝑡subscript𝔼subscript𝑃𝑡delimited-[]superscriptsubscript𝑍𝑡12superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡12𝑐2𝛿\displaystyle\left\lvert\sum_{t=1}^{N}Z^{h}_{t}-\mathbb{E}_{P_{t}}[Z_{t}^{h}]% \right\rvert\leq\frac{1}{2}\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t% }))-l(h^{*}_{t}(x_{t})))+12c\log(2\left\lvert\mathcal{H}\right\rvert/\delta),| ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ] | ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) + 12 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) ,

by union bound over hh\in\mathcal{H}italic_h ∈ caligraphic_H. Using (19), we rearrange the above to get:

t=1NZth32t=1N𝔼xtPt(l(h(xt))l(ht(xt)))+12clog(2||/δ)superscriptsubscript𝑡1𝑁superscriptsubscript𝑍𝑡32superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡12𝑐2𝛿\displaystyle\sum_{t=1}^{N}Z_{t}^{h}\leq\frac{3}{2}\sum_{t=1}^{N}\mathbb{E}_{x% _{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(x_{t})))+12c\log(2\left\lvert\mathcal{% H}\right\rvert/\delta)∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) + 12 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) (21)
and
t=1N𝔼xtPt(l(h(xt))l(ht(xt)))2t=1NZth+24clog(2||/δ).superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡2superscriptsubscript𝑡1𝑁superscriptsubscript𝑍𝑡24𝑐2𝛿\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t% }(x_{t})))\leq 2\sum_{t=1}^{N}Z_{t}^{h}+24c\log(2\left\lvert\mathcal{H}\right% \rvert/\delta).∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ≤ 2 ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT + 24 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) . (22)

Define the function h~argminht=1N𝔼xtPt(l(h(xt))l(ht(xt)))~subscriptargminsuperscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\widetilde{h}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\sum_{t=1}^{N}% \mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(x_{t})))over~ start_ARG italic_h end_ARG ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ), which is independent of the dataset 𝒟𝒟\mathcal{D}caligraphic_D. By (21) for h~~\widetilde{h}over~ start_ARG italic_h end_ARG and the approximate realizability assumption, we get

t=1NZth~superscriptsubscript𝑡1𝑁superscriptsubscript𝑍𝑡~\displaystyle\sum_{t=1}^{N}Z_{t}^{\widetilde{h}}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_h end_ARG end_POSTSUPERSCRIPT 32t=1N𝔼xtPt(l(h(xt))l(ht(xt)))+12clog(2||/δ)32γN+12clog(2||/δ).absent32superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡12𝑐2𝛿32𝛾𝑁12𝑐2𝛿\displaystyle\leq\frac{3}{2}\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{% t}))-l(h^{*}_{t}(x_{t})))+12c\log(2\left\lvert\mathcal{H}\right\rvert/\delta)% \leq\frac{3}{2}\gamma N+12c\log(2\left\lvert\mathcal{H}\right\rvert/\delta).≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ( italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) + 12 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) ≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_γ italic_N + 12 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) .

By definitions of h~~\widetilde{h}over~ start_ARG italic_h end_ARG and the ERM function h^^\widehat{h}over^ start_ARG italic_h end_ARG, we have that

t=1NZth^superscriptsubscript𝑡1𝑁superscriptsubscript𝑍𝑡^\displaystyle\sum_{t=1}^{N}Z_{t}^{\widehat{h}}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG end_POSTSUPERSCRIPT =t=1Nl(h^(xt))l(ht(xt))t=1Nl(h~(xt))l(ht(xt))=t=1NZth~.absentsuperscriptsubscript𝑡1𝑁𝑙^subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡superscriptsubscript𝑡1𝑁𝑙~subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡superscriptsubscript𝑡1𝑁superscriptsubscript𝑍𝑡~\displaystyle=\sum_{t=1}^{N}l(\widehat{h}(x_{t}))-l(h^{*}_{t}(x_{t}))\leq\sum_% {t=1}^{N}l(\widetilde{h}(x_{t}))-l(h^{*}_{t}(x_{t}))=\sum_{t=1}^{N}Z_{t}^{% \widetilde{h}}.= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l ( over^ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l ( over~ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_h end_ARG end_POSTSUPERSCRIPT .

From the above two relations, we get

t=1NZth^superscriptsubscript𝑡1𝑁superscriptsubscript𝑍𝑡^\displaystyle\sum_{t=1}^{N}Z_{t}^{\widehat{h}}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG end_POSTSUPERSCRIPT 32γN+12clog(2||/δ).absent32𝛾𝑁12𝑐2𝛿\displaystyle\leq\frac{3}{2}\gamma N+12c\log(2\left\lvert\mathcal{H}\right% \rvert/\delta).≤ divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_γ italic_N + 12 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) .

Now, using this and using (22) for the function h^^\widehat{h}over^ start_ARG italic_h end_ARG, we get

t=1N𝔼xtPtl(h^(xt))l(ht(xt))superscriptsubscript𝑡1𝑁subscript𝔼similar-tosubscript𝑥𝑡subscript𝑃𝑡𝑙^subscript𝑥𝑡𝑙subscriptsuperscript𝑡subscript𝑥𝑡\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}l(\widehat{h}(x_{t}))-l% (h^{*}_{t}(x_{t}))∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( over^ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_l ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) 2t=1NZth^+24clog(2||/δ)3γN+48clog(2||/δ),absent2superscriptsubscript𝑡1𝑁superscriptsubscript𝑍𝑡^24𝑐2𝛿3𝛾𝑁48𝑐2𝛿\displaystyle\leq 2\sum_{t=1}^{N}Z_{t}^{\widehat{h}}+24c\log(2\left\lvert% \mathcal{H}\right\rvert/\delta)\leq 3\gamma N+48c\log(2\left\lvert\mathcal{H}% \right\rvert/\delta),≤ 2 ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG end_POSTSUPERSCRIPT + 24 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) ≤ 3 italic_γ italic_N + 48 italic_c roman_log ( 2 | caligraphic_H | / italic_δ ) ,

which holds with probability at least 1δ1𝛿1-\delta1 - italic_δ. This completes the proof. ∎

We now state a useful result for an infinite-horizon discounted robust φ𝜑\varphiitalic_φ-regularized Markov decision process (𝒮,𝒜,r,Po,λ,γ,φ,d0)𝒮𝒜𝑟superscript𝑃𝑜𝜆𝛾𝜑subscript𝑑0(\mathcal{S},\mathcal{A},r,P^{o},\lambda,\gamma,\varphi,d_{0})( caligraphic_S , caligraphic_A , italic_r , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ , italic_γ , italic_φ , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). This result helps our RPQ algorithm’s policy search space to be the class of deterministic Markov policies.

Proposition 5.

The robust regularized Bellman operator 𝒯𝒯\mathcal{T}caligraphic_T (3)

(𝒯Q)(s,a)=r(s,a)+γinfPs,a𝒫s,a(𝔼sPs,a[maxaQ(s,a)]+λDφ(Ps,a,Ps,ao)),𝒯𝑄𝑠𝑎𝑟𝑠𝑎𝛾subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑎𝑄superscript𝑠superscript𝑎𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle(\mathcal{T}Q)(s,a)=r(s,a)+\gamma\inf_{P_{s,a}\in\mathcal{P}_{s,a% }}\big{(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})\big{)},( caligraphic_T italic_Q ) ( italic_s , italic_a ) = italic_r ( italic_s , italic_a ) + italic_γ roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ) ,

and the value function operator (𝒯vV)()=maxa(𝒯Q)(,a)subscript𝒯𝑣𝑉subscript𝑎𝒯𝑄𝑎(\mathcal{T}_{v}V)(\cdot)=\max_{a}(\mathcal{T}Q)(\cdot,a)( caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_V ) ( ⋅ ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( caligraphic_T italic_Q ) ( ⋅ , italic_a ) are both γ𝛾\gammaitalic_γ-contraction operators w.r.t sup-norm. Moreover, their respective unique fixed points Qλsubscriptsuperscript𝑄𝜆Q^{*}_{\lambda}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT and Vλsubscriptsuperscript𝑉𝜆V^{*}_{\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT, for optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, achieve the optimal robust value maxπVλπsubscript𝜋subscriptsuperscript𝑉𝜋𝜆\max_{\pi}V^{\pi}_{\lambda}roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT. Furthermore, the robust regularized optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a deterministic Markov policy satisfying π()=argmaxaQλ(,a)superscript𝜋subscriptargmax𝑎subscriptsuperscript𝑄𝜆𝑎\pi^{*}(\cdot)=\operatorname*{arg\,max}_{a}Q^{*}_{\lambda}(\cdot,a)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( ⋅ , italic_a ).

Proof.

The γ𝛾\gammaitalic_γ-contraction property of both operators directly follow from the fact infxp(x)infxq(x)supx(p(x)q(x))subscriptinfimum𝑥𝑝𝑥subscriptinfimum𝑥𝑞𝑥subscriptsupremum𝑥𝑝𝑥𝑞𝑥\inf_{x}p(x)-\inf_{x}q(x)\leq\sup_{x}(p(x)-q(x))roman_inf start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p ( italic_x ) - roman_inf start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_q ( italic_x ) ≤ roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_p ( italic_x ) - italic_q ( italic_x ) ). Furthermore, this result is a direct corollary of (Yang et al.,, 2023, Proposition 3.1) and (Iyengar,, 2005, Corollary 3.1). ∎

We now state a similar result for a finite-horizon discounted robust φ𝜑\varphiitalic_φ-regularized Markov decision process (𝒮,𝒜,Po=(Pho)h=0H1,r=(rh)h=0H1,λ,H,φ,d0)formulae-sequence𝒮𝒜superscript𝑃𝑜superscriptsubscriptsubscriptsuperscript𝑃𝑜0𝐻1𝑟superscriptsubscriptsubscript𝑟0𝐻1𝜆𝐻𝜑subscript𝑑0(\mathcal{S},\mathcal{A},P^{o}=(P^{o}_{h})_{h=0}^{H-1},r=(r_{h})_{h=0}^{H-1},% \lambda,H,\varphi,d_{0})( caligraphic_S , caligraphic_A , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = ( italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT , italic_r = ( italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT , italic_λ , italic_H , italic_φ , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). This result helps our HyTQ algorithm’s policy search space to be the class of non-stationary deterministic Markov policies.

Proposition 6.

The robust regularized Bellman operator 𝒯𝒯\mathcal{T}caligraphic_T (10) and the value function operator 𝒯vsubscript𝒯𝑣\mathcal{T}_{v}caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are as follows:

(𝒯Qh+1)(s,a)=rh(s,a)+infPh,s,a𝒫h,s,a(𝔼sPh,s,a[maxaQh+1(s,a)]+λDφ(Ph,s,a,Ph,s,ao))and𝒯subscript𝑄1𝑠𝑎subscript𝑟𝑠𝑎subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑎subscript𝑄1superscript𝑠superscript𝑎𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎and\displaystyle(\mathcal{T}Q_{h+1})(s,a)=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P% }_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[\max_{a^{\prime}}Q_{h+1% }(s^{\prime},a^{\prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}% \quad\text{and}( caligraphic_T italic_Q start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ) ) and
(𝒯vVh+1)(s)=maxa[rh(s,a)+infPh,s,a𝒫h,s,a(𝔼sPh,s,a[Vh+1(s)]+λDφ(Ph,s,a,Ph,s,ao))].subscript𝒯𝑣subscript𝑉1𝑠subscript𝑎subscript𝑟𝑠𝑎subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscript𝑉1superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle(\mathcal{T}_{v}V_{h+1})(s)=\max_{a}\bigg{[}r_{h}(s,a)+\inf_{P_{h% ,s,a}\in\mathcal{P}_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[V_{h+% 1}(s^{\prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}\bigg{]}.( caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ) ) ] .

The optimal robust value Vh,λsubscriptsuperscript𝑉𝜆V^{*}_{h,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT satisfies the following robust dynamic programming procedure: Starting with VH,λ=0subscriptsuperscript𝑉𝐻𝜆0V^{*}_{H,\lambda}=0italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_λ end_POSTSUBSCRIPT = 0, doing backward iteration of 𝒯vsubscript𝒯𝑣\mathcal{T}_{v}caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, i.e., Vh,λ=𝒯vVh+1,λsubscriptsuperscript𝑉𝜆subscript𝒯𝑣subscriptsuperscript𝑉1𝜆V^{*}_{h,\lambda}=\mathcal{T}_{v}V^{*}_{h+1,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT = caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 , italic_λ end_POSTSUBSCRIPT, we get Vh,λsubscriptsuperscript𝑉𝜆V^{*}_{h,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ]. Furthermore, the robust regularized optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a non-stationary deterministic Markov policy satisfying πh()=argmaxaQh,λ(,a)subscriptsuperscript𝜋subscriptargmax𝑎subscriptsuperscript𝑄𝜆𝑎\pi^{*}_{h}(\cdot)=\operatorname*{arg\,max}_{a}Q^{*}_{h,\lambda}(\cdot,a)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ( ⋅ , italic_a ) for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] where

Qh,λ(,a)=rh(s,a)+infPh,s,a𝒫h,s,a(𝔼sPh,s,a[Vh+1(s)]+λDφ(Ph,s,a,Ph,s,ao)).subscriptsuperscript𝑄𝜆𝑎subscript𝑟𝑠𝑎subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑉1superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle Q^{*}_{h,\lambda}(\cdot,a)=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal% {P}_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[V^{*}_{h+1}(s^{\prime% })]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}.italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ( ⋅ , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ) ) .

Moreover, as VH,λ=0=QH,λsubscriptsuperscript𝑉𝐻𝜆0subscriptsuperscript𝑄𝐻𝜆V^{*}_{H,\lambda}=0=Q^{*}_{H,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_λ end_POSTSUBSCRIPT = 0 = italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_λ end_POSTSUBSCRIPT, it suffices to backward iterate 𝒯𝒯\mathcal{T}caligraphic_T, i.e., do Qh,λ=𝒯Qh+1,λsubscriptsuperscript𝑄𝜆𝒯subscriptsuperscript𝑄1𝜆Q^{*}_{h,\lambda}=\mathcal{T}Q^{*}_{h+1,\lambda}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT = caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 , italic_λ end_POSTSUBSCRIPT to get Qh,λsubscriptsuperscript𝑄𝜆Q^{*}_{h,\lambda}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ].

Proof.

We start with the optimal robust value definition Vh,λ=maxπVh,λπ=maxπinfP𝒫VP,rhλh,πsubscriptsuperscript𝑉𝜆subscript𝜋subscriptsuperscript𝑉𝜋𝜆subscript𝜋subscriptinfimum𝑃𝒫subscriptsuperscript𝑉𝜋𝑃subscriptsuperscript𝑟𝜆V^{*}_{h,\lambda}=\max_{\pi}V^{\pi}_{h,\lambda}=\max_{\pi}\inf_{P\in\mathcal{P% }}V^{h,\pi}_{P,r^{\lambda}_{h}}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P , italic_r start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The value function claims in this statement are direct consequences of (Iyengar,, 2005, Theorem 2.1 & 2.2) and (Zhang et al.,, 2023, Theorem 2) with the reward function rhλsubscriptsuperscript𝑟𝜆r^{\lambda}_{h}italic_r start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

It remains to prove Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT dynamic programming with 𝒯𝒯\mathcal{T}caligraphic_T. That is, we establish Vh,λ()=maxaQh,λ(,a)subscriptsuperscript𝑉𝜆subscript𝑎subscriptsuperscript𝑄𝜆𝑎V^{*}_{h,\lambda}(\cdot)=\max_{a}Q^{*}_{h,\lambda}(\cdot,a)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ( ⋅ ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ( ⋅ , italic_a ) for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] with the dynamic programming of 𝒯𝒯\mathcal{T}caligraphic_T. We use induction to prove this. The base case is trivially true since VH,λ=0=QH,λsubscriptsuperscript𝑉𝐻𝜆0subscriptsuperscript𝑄𝐻𝜆V^{*}_{H,\lambda}=0=Q^{*}_{H,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_λ end_POSTSUBSCRIPT = 0 = italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_λ end_POSTSUBSCRIPT. By 𝒯𝒯\mathcal{T}caligraphic_T, we have

Qh,λ(s,a)subscriptsuperscript𝑄𝜆𝑠𝑎\displaystyle Q^{*}_{h,\lambda}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ( italic_s , italic_a ) =(𝒯Qh+1,λ)(s,a)absent𝒯subscriptsuperscript𝑄1𝜆𝑠𝑎\displaystyle=(\mathcal{T}Q^{*}_{h+1,\lambda})(s,a)= ( caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 , italic_λ end_POSTSUBSCRIPT ) ( italic_s , italic_a )
=rh(s,a)+infPh,s,a𝒫h,s,a(𝔼sPh,s,a[maxaQh+1(s,a)]+λDφ(Ph,s,a,Ph,s,ao))absentsubscript𝑟𝑠𝑎subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑎subscriptsuperscript𝑄1superscript𝑠superscript𝑎𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P}_{h,s,a}}\big{(}\mathbb{% E}_{s^{\prime}\sim P_{h,s,a}}[\max_{a^{\prime}}Q^{*}_{h+1}(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}= italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ) )
=rh(s,a)+infPh,s,a𝒫h,s,a(𝔼sPh,s,a[Vh+1(s)]+λDφ(Ph,s,a,Ph,s,ao)),absentsubscript𝑟𝑠𝑎subscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑉1superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P}_{h,s,a}}\big{(}\mathbb{% E}_{s^{\prime}\sim P_{h,s,a}}[V^{*}_{h+1}(s^{\prime})]+\lambda D_{\varphi}(P_{% h,s,a},P^{o}_{h,s,a})\big{)},= italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ) ) ,

where the last equality follows by the induction hypothesis Vh+1,λ()=maxaQh+1,λ(,a)subscriptsuperscript𝑉1𝜆subscript𝑎subscriptsuperscript𝑄1𝜆𝑎V^{*}_{h+1,\lambda}(\cdot)=\max_{a}Q^{*}_{h+1,\lambda}(\cdot,a)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 , italic_λ end_POSTSUBSCRIPT ( ⋅ ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 , italic_λ end_POSTSUBSCRIPT ( ⋅ , italic_a ). Maximizing this both sides with action a𝑎aitalic_a and by the dynamic program Vh,λ=𝒯vVh+1,λsubscriptsuperscript𝑉𝜆subscript𝒯𝑣subscriptsuperscript𝑉1𝜆V^{*}_{h,\lambda}=\mathcal{T}_{v}V^{*}_{h+1,\lambda}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT = caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 , italic_λ end_POSTSUBSCRIPT, we get Vh,λ()=maxaQh,λ(,a)subscriptsuperscript𝑉𝜆subscript𝑎subscriptsuperscript𝑄𝜆𝑎V^{*}_{h,\lambda}(\cdot)=\max_{a}Q^{*}_{h,\lambda}(\cdot,a)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ( ⋅ ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_λ end_POSTSUBSCRIPT ( ⋅ , italic_a ). This completes the proof of this result. ∎

Appendix D Offline Robust φ𝜑\varphiitalic_φ-regularized RL Results ☕☕☕

In this section, we set Vmax=1/(1γ)subscript𝑉11𝛾V_{\max}=1/(1-\gamma)italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1 / ( 1 - italic_γ ) whenever we use results from Proposition 3. In the following, we use constants c1,c2,c3subscript𝑐1subscript𝑐2subscript𝑐3c_{1},c_{2},c_{3}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT from Proposition 3.

We first prove Proposition 1 that directly follows from Lemma 1.

Proof of Proposition 1.

For each (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), consider the optimization problem in (3)

infPs,a𝒫s,asubscriptinfimumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎\displaystyle\inf_{P_{s,a}\in\mathcal{P}_{s,a}}roman_inf start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT (𝔼sPs,a[V(s)]+λDφ(Ps,a,Ps,ao))=supPs,a𝒫s,a(𝔼sPs,a[V(s)]λDφ(Ps,a,Ps,ao))subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]𝑉superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎subscriptsupremumsubscript𝑃𝑠𝑎subscript𝒫𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]𝑉superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎\displaystyle\big{(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[V(s^{\prime})]+\lambda D% _{\varphi}(P_{s,a},P^{o}_{s,a})\big{)}=-\sup_{P_{s,a}\in\mathcal{P}_{s,a}}\big% {(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[-V(s^{\prime})]-\lambda D_{\varphi}(P_{% s,a},P^{o}_{s,a})\big{)}( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ) = - roman_sup start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) )
=(a)infη(λ𝔼sPs,ao[φ(ηV(s)λ)]+η)superscript𝑎absentsubscriptinfimumsuperscript𝜂𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑superscript𝜂𝑉superscript𝑠𝜆superscript𝜂\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}-\inf_{\eta^{\prime}\in\mathbb{R% }}(\lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}[\varphi^{*}\left(\frac{-\eta% ^{\prime}-V(s^{\prime})}{\lambda}\right)]+\eta^{\prime})start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP - roman_inf start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ end_ARG ) ] + italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=(b)infη(λ𝔼sPs,ao[φ(ηV(s)λ)]η)superscript𝑏absentsubscriptinfimum𝜂𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑𝜂𝑉superscript𝑠𝜆𝜂\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}-\inf_{\eta\in\mathbb{R}}(% \lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}[\varphi^{*}\left(\frac{\eta-V(s% ^{\prime})}{\lambda}\right)]-\eta)start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_b ) end_ARG end_RELOP - roman_inf start_POSTSUBSCRIPT italic_η ∈ blackboard_R end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG italic_η - italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ end_ARG ) ] - italic_η )
=(c)infηΘ(λ𝔼sPs,ao[φ(ηV(s)λ)]η),superscript𝑐absentsubscriptinfimum𝜂Θ𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]superscript𝜑𝜂𝑉superscript𝑠𝜆𝜂\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}-\inf_{\eta\in\Theta}(\lambda% \mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}[\varphi^{*}\left(\frac{\eta-V(s^{% \prime})}{\lambda}\right)]-\eta),start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_c ) end_ARG end_RELOP - roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( divide start_ARG italic_η - italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ end_ARG ) ] - italic_η ) ,

where (a)𝑎(a)( italic_a ) follows from Lemma 1, (b)𝑏(b)( italic_b ) by setting η=η𝜂superscript𝜂\eta=-\eta^{\prime}italic_η = - italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and (c)𝑐(c)( italic_c ) by Proposition 3. This completes the proof. ∎

We now prove Proposition 2 which mainly follows from Lemma 5.

Proof of Proposition 2.

Since the conjugate function φ()superscript𝜑\varphi^{*}(\cdot)italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) is continuous, define a continuous function in η𝜂\etaitalic_η for each (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in\mathcal{S}\times\mathcal{A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A h((s,a),η)=(λ𝔼sPs,aoφ((ηmaxaf(s,a))/λ)η)𝑠𝑎𝜂𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎superscript𝜑𝜂subscriptsuperscript𝑎𝑓superscript𝑠superscript𝑎𝜆𝜂h((s,a),\eta)=(\lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}\varphi^{*}\left(% {(\eta-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda}\right)-\eta)italic_h ( ( italic_s , italic_a ) , italic_η ) = ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_η ). We observe h((s,a),η)𝑠𝑎𝜂h((s,a),\eta)italic_h ( ( italic_s , italic_a ) , italic_η ) in (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in\mathcal{S}\times\mathcal{A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A is Σ(𝒮×𝒜)Σ𝒮𝒜\Sigma(\mathcal{S}\times\mathcal{A})roman_Σ ( caligraphic_S × caligraphic_A )-measurable for each ηΘ𝜂Θ\eta\in\Thetaitalic_η ∈ roman_Θ, where ΘΘ\Thetaroman_Θ is a bounded real line. This lemma now directly follows by similar arguments in the proof of Panaganti et al., (2022, Lemma 1). ∎

Now we state a result and provide its proof for the empirical risk minimization on the dual parameter.

Proposition 7 (Dual Optimization Error Bound).

Let g^fsubscript^𝑔𝑓\widehat{g}_{f}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT be the dual optimization parameter from Algorithm 1 (Step 4) for the state-action value function f𝑓fitalic_f and let 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT be as defined in (7). With probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

supf𝒯f𝒯g^ff1,μ2γc2c32log(|𝒢|)N+5c12log(8||/δ)N+γε𝒢.subscriptsupremum𝑓subscriptnorm𝒯𝑓subscript𝒯subscript^𝑔𝑓𝑓1𝜇2𝛾subscript𝑐2subscript𝑐32𝒢𝑁5subscript𝑐128𝛿𝑁𝛾subscript𝜀𝒢\sup_{f\in\mathcal{F}}\|\mathcal{T}f-\mathcal{T}_{\widehat{g}_{f}}f\|_{1,\mu}% \leq 2\gamma c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{G}|)}{N}}+5c_{1}\sqrt{\frac% {2\log(8|\mathcal{F}|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}}.roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ caligraphic_T italic_f - caligraphic_T start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ≤ 2 italic_γ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( | caligraphic_G | ) end_ARG start_ARG italic_N end_ARG end_ARG + 5 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 8 | caligraphic_F | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG + italic_γ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT .
Proof.

We adapt the proof from Panaganti et al., (2022, Lemma 6). We first fix f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F. We will also invoke union bound for the supremum here. We recall from (8) that g^f=argming𝒢L^dual(g;f)subscript^𝑔𝑓subscriptargmin𝑔𝒢subscript^𝐿dual𝑔𝑓\widehat{g}_{f}=\operatorname*{arg\,min}_{g\in\mathcal{G}}\widehat{L}_{\mathrm% {dual}}(g;f)over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_f ). From the robust Bellman equation, we directly obtain

𝒯g^ff𝒯f1,μsubscriptnormsubscript𝒯subscript^𝑔𝑓𝑓𝒯𝑓1𝜇\displaystyle\|\mathcal{T}_{\widehat{g}_{f}}f-\mathcal{T}f\|_{1,\mu}∥ caligraphic_T start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f - caligraphic_T italic_f ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT =γ(𝔼s,aμ|𝔼sPs,ao(λφ((g^f(s,a)maxaf(s,a))/λ)g^f(s,a))\displaystyle=\gamma(\mathbb{E}_{s,a\sim\mu}|\mathbb{E}_{s^{\prime}\sim P^{o}_% {s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a^{\prime}}f(s^{\prime}% ,a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))= italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) )
infηΘ(λ𝔼sPs,aoφ((ηmaxaf(s,a))/λ)η)|)\displaystyle\hskip 56.9055pt-\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{s,a}}\varphi^{*}\left({(\eta-\max_{a^{\prime}}f(s^{\prime},% a^{\prime}))}/{\lambda}\right)-\eta)|)- roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_η ) | )
=(a)γ(𝔼s,aμ𝔼sPs,ao(λφ((g^f(s,a)maxaf(s,a))/λ)g^f(s,a))\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\gamma(\mathbb{E}_{s,a\sim\mu}% \mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s% ,a)-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) )
𝔼s,aμ[infηΘ(λ𝔼sPs,aoφ((ηmaxaf(s,a))/λ)η))])\displaystyle\hskip 28.45274pt-\mathbb{E}_{s,a\sim\mu}[\inf_{\eta\in\Theta}(% \lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}\varphi^{*}\left({(\eta-\max_{a^% {\prime}}f(s^{\prime},a^{\prime}))}/{\lambda}\right)-\eta))])- blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ end_POSTSUBSCRIPT [ roman_inf start_POSTSUBSCRIPT italic_η ∈ roman_Θ end_POSTSUBSCRIPT ( italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_η ) ) ] )
=(b)γ(𝔼s,aμ,sPs,ao(λφ((g^f(s,a)maxaf(s,a))/λ)g^f(s,a))\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\gamma(\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a^{% \prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_b ) end_ARG end_RELOP italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) )
infgL1(μ)𝔼s,aμ,sPs,ao(λφ((g(s,a)maxaf(s,a))/λ)g(s,a)))\displaystyle\hskip 28.45274pt-\inf_{g\in L^{1}(\mu)}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))- roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ) )
=γ(𝔼s,aμ,sPs,ao(λφ((g^f(s,a)maxaf(s,a))/λ)g^f(s,a))\displaystyle=\gamma(\mathbb{E}_{s,a\sim\mu,s^{\prime}\sim P^{o}_{s,a}}(% \lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a^{\prime}}f(s^{\prime},a^{% \prime}))}/{\lambda})-\widehat{g}_{f}(s,a))= italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) )
infg𝒢𝔼s,aμ,sPs,ao(λφ((g(s,a)maxaf(s,a))/λ)g(s,a)))\displaystyle\hskip 28.45274pt-\inf_{g\in\mathcal{G}}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))- roman_inf start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ) )
+γ(infg𝒢𝔼s,aμ,sPs,ao(λφ((g(s,a)maxaf(s,a))/λ)g(s,a))\displaystyle\hskip 28.45274pt+\gamma(\inf_{g\in\mathcal{G}}\mathbb{E}_{s,a% \sim\mu,s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{% \prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-g(s,a))+ italic_γ ( roman_inf start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) )
infgL1(μ)𝔼s,aμ,sPs,ao(λφ((g(s,a)maxaf(s,a))/λ)g(s,a)))\displaystyle\hskip 42.67912pt-\inf_{g\in L^{1}(\mu)}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))- roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ) )
(c)γ(𝔼s,aμ,sPs,ao(λφ((g^f(s,a)maxaf(s,a))/λ)g^f(s,a))\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\gamma(\mathbb{E}_{s,a\sim\mu% ,s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a% ^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_c ) end_ARG end_RELOP italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_s , italic_a ) )
infg𝒢𝔼s,aμ,sPs,ao(λφ((g(s,a)maxaf(s,a))/λ)g(s,a)))+γε𝒢\displaystyle\hskip 28.45274pt-\inf_{g\in\mathcal{G}}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))+\gamma\varepsilon_{\mathcal{G}}- roman_inf start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ) ) + italic_γ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT
(d)2γc2c32log(|𝒢|)N+5c12log(8/δ)N+γε𝒢.superscript𝑑absent2𝛾subscript𝑐2subscript𝑐32𝒢𝑁5subscript𝑐128𝛿𝑁𝛾subscript𝜀𝒢\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}2\gamma c_{2}c_{3}\sqrt{\frac% {2\log(|\mathcal{G}|)}{N}}+5c_{1}\sqrt{\frac{2\log(8/\delta)}{N}}+\gamma% \varepsilon_{\mathcal{G}}.start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_d ) end_ARG end_RELOP 2 italic_γ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( | caligraphic_G | ) end_ARG start_ARG italic_N end_ARG end_ARG + 5 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 8 / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG + italic_γ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT .

(a)𝑎(a)( italic_a ) follows since infgh(g)h(g^f)subscriptinfimum𝑔𝑔subscript^𝑔𝑓\inf_{g}h(g)\leq h(\widehat{g}_{f})roman_inf start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_h ( italic_g ) ≤ italic_h ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ). (b)𝑏(b)( italic_b ) follows from Proposition 2. (c)𝑐(c)( italic_c ) follows from the approximate dual realizability assumption (Assumption 3).

For (d)𝑑(d)( italic_d ), we consider the loss function l(g,(s,a,s))=λφ((g(s,a)maxaf(s,a))/λ)g(s,a)𝑙𝑔𝑠𝑎superscript𝑠𝜆superscript𝜑𝑔𝑠𝑎subscriptsuperscript𝑎𝑓superscript𝑠superscript𝑎𝜆𝑔𝑠𝑎l(g,(s,a,s^{\prime}))=\lambda\varphi^{*}\left({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda}\right)-g(s,a)italic_l ( italic_g , ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) (for e.g. l(g,(s,a,s))=[(g(s,a)+2λmaxaf(s,a))+2]/4λλg(s,a)𝑙𝑔𝑠𝑎superscript𝑠delimited-[]subscriptsuperscript𝑔𝑠𝑎2𝜆subscriptsuperscript𝑎𝑓superscript𝑠superscript𝑎24𝜆𝜆𝑔𝑠𝑎l(g,(s,a,s^{\prime}))=[(g(s,a)+2\lambda-\max_{a^{\prime}}f(s^{\prime},a^{% \prime}))^{2}_{+}]/{4\lambda}-\lambda-g(s,a)italic_l ( italic_g , ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = [ ( italic_g ( italic_s , italic_a ) + 2 italic_λ - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] / 4 italic_λ - italic_λ - italic_g ( italic_s , italic_a )) and dataset 𝒟={si,ai,si}i=1N𝒟superscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖superscriptsubscript𝑠𝑖𝑖1𝑁\mathcal{D}=\{s_{i},a_{i},s_{i}^{\prime}\}_{i=1}^{N}caligraphic_D = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Since f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F and g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, we note that |l(g,(s,a,s))|c1𝑙𝑔𝑠𝑎superscript𝑠subscript𝑐1|l(g,(s,a,s^{\prime}))|\leq c_{1}| italic_l ( italic_g , ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) | ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where the value of c1>0subscript𝑐10c_{1}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 depend on specific forms of φsuperscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as demonstrated in Proposition 3. Furthermore, take l(g,(s,a,s))𝑙𝑔𝑠𝑎superscript𝑠l(g,(s,a,s^{\prime}))italic_l ( italic_g , ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) to be c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-Lipschitz in g𝑔gitalic_g and |g(s,a)|c3𝑔𝑠𝑎subscript𝑐3|g(s,a)|\leq c_{3}| italic_g ( italic_s , italic_a ) | ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, since g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, for some positive constants c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Again, these constants depend on specific forms of φsuperscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as demonstrated in Proposition 3. With these insights, we can apply the empirical risk minimization result in Lemma 4 to get (d)𝑑(d)( italic_d ).

With union bound, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we finally get

supf𝒯f𝒯g^ff1,μ2γc2c32log(|𝒢|)N+5c12log(8||/δ)N+γε𝒢,subscriptsupremum𝑓subscriptnorm𝒯𝑓subscript𝒯subscript^𝑔𝑓𝑓1𝜇2𝛾subscript𝑐2subscript𝑐32𝒢𝑁5subscript𝑐128𝛿𝑁𝛾subscript𝜀𝒢\displaystyle\sup_{f\in\mathcal{F}}\|\mathcal{T}f-\mathcal{T}_{\widehat{g}_{f}% }f\|_{1,\mu}\leq 2\gamma c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{G}|)}{N}}+5c_{1% }\sqrt{\frac{2\log(8|\mathcal{F}|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}},roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ caligraphic_T italic_f - caligraphic_T start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ≤ 2 italic_γ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( | caligraphic_G | ) end_ARG start_ARG italic_N end_ARG end_ARG + 5 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 8 | caligraphic_F | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG + italic_γ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ,

which concludes the proof. ∎

We next prove the least-squares generalization bound for the RFQI algorithm.

Proposition 8 (Least squares generalization bound).

Let f^gsubscript^𝑓𝑔\widehat{f}_{g}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT be the least-squares solution from Algorithm 1 (Step 5) for the state-action value function f𝑓fitalic_f and dual variable function g𝑔gitalic_g. Let 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT be as defined in (7). Then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

supfsupg𝒢𝒯gff^g2,μsubscriptsupremum𝑓subscriptsupremum𝑔𝒢subscriptnormsubscript𝒯𝑔𝑓subscript^𝑓𝑔2𝜇\displaystyle\sup_{f\in\mathcal{F}}\sup_{g\in\mathcal{G}}\|\mathcal{T}_{g}f-% \widehat{f}_{g}\|_{2,\mu}roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT 6ε+2(1γ)2+18(1+γc1)18log(2|||𝒢|/δ)N.absent6subscript𝜀2superscript1𝛾2181𝛾subscript𝑐1182𝒢𝛿𝑁\displaystyle\leq\sqrt{6\varepsilon_{\mathcal{F}}}+\sqrt{\frac{2}{(1-\gamma)^{% 2}}+18(1+\gamma c_{1})}\sqrt{\frac{18\log(2|\mathcal{F}||\mathcal{G}|/\delta)}% {N}}.≤ square-root start_ARG 6 italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT end_ARG + square-root start_ARG divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 18 ( 1 + italic_γ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG square-root start_ARG divide start_ARG 18 roman_log ( 2 | caligraphic_F | | caligraphic_G | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG .
Proof.

We adapt the least-squares generalization bound given in Agarwal et al., (2019, Lemma A.11) to our setting. We recall from (9) that f^g=argminQL^robQ(Q;f,g)subscript^𝑓𝑔subscriptargmin𝑄subscript^𝐿robQ𝑄𝑓𝑔\widehat{f}_{g}=\operatorname*{arg\,min}_{Q\in\mathcal{F}}\widehat{L}_{\mathrm% {robQ}}(Q;f,g)over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_Q ∈ caligraphic_F end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_robQ end_POSTSUBSCRIPT ( italic_Q ; italic_f , italic_g ). We first fix functions f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F and g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G. For any function fsuperscript𝑓f^{\prime}\in\mathcal{F}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F, we define random variables zifsuperscriptsubscript𝑧𝑖superscript𝑓z_{i}^{f^{\prime}}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as

zif=(f(si,ai)yi)2((𝒯gf)(si,ai)yi)2,superscriptsubscript𝑧𝑖superscript𝑓superscriptsuperscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝑦𝑖2superscriptsubscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝑦𝑖2\displaystyle z_{i}^{f^{\prime}}=\left(f^{\prime}(s_{i},a_{i})-y_{i}\right)^{2% }-\left((\mathcal{T}_{g}f)(s_{i},a_{i})-y_{i}\right)^{2},italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where yi=riγλφ((g(si,ai)maxaf(si,a))/λ)+γg(si,ai)subscript𝑦𝑖subscript𝑟𝑖𝛾𝜆superscript𝜑𝑔subscript𝑠𝑖subscript𝑎𝑖subscriptsuperscript𝑎𝑓superscriptsubscript𝑠𝑖superscript𝑎𝜆𝛾𝑔subscript𝑠𝑖subscript𝑎𝑖y_{i}=r_{i}-\gamma\lambda\varphi^{*}({(g(s_{i},a_{i})-\max_{a^{\prime}}f(s_{i}% ^{\prime},a^{\prime}))}/{\lambda})+\gamma g(s_{i},a_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_γ italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) + italic_γ italic_g ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and (si,ai,si)𝒟subscript𝑠𝑖subscript𝑎𝑖subscriptsuperscript𝑠𝑖𝒟(s_{i},a_{i},s^{\prime}_{i})\in\mathcal{D}( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D with (si,ai)μ,siPsi,aioformulae-sequencesimilar-tosubscript𝑠𝑖subscript𝑎𝑖𝜇similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖(s_{i},a_{i})\sim\mu,s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. It is straightforward to note that for a given (si,ai)subscript𝑠𝑖subscript𝑎𝑖(s_{i},a_{i})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we have 𝔼siPsi,aio[yi]=(𝒯gf)(si,ai)subscript𝔼similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]subscript𝑦𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[y_{i}]=(\mathcal{T}_{g}f)(% s_{i},a_{i})blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We note the randomness of zifsuperscriptsubscript𝑧𝑖superscript𝑓z_{i}^{f^{\prime}}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT given f,f𝑓superscript𝑓f,f^{\prime}\in\mathcal{F}italic_f , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F and g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G is from the dataset pairs (si,ai,si)subscript𝑠𝑖subscript𝑎𝑖superscriptsubscript𝑠𝑖(s_{i},a_{i},s_{i}^{\prime})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Since f,f𝑓superscript𝑓f,f^{\prime}\in\mathcal{F}italic_f , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F and g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, from Proposition 3, we write both (𝒯gf)(si,ai),yi1+γc1subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝑦𝑖1𝛾subscript𝑐1(\mathcal{T}_{g}f)(s_{i},a_{i}),y_{i}\leq 1+\gamma c_{1}( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 1 + italic_γ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where the value of c1>0subscript𝑐10c_{1}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 depend on specific forms of φsuperscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Using this, we obtain the first moment and an upper-bound for the second moment of zifsuperscriptsubscript𝑧𝑖superscript𝑓z_{i}^{f^{\prime}}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as follows:

𝔼siPsi,aio[zif]subscript𝔼similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscriptsubscript𝑧𝑖superscript𝑓\displaystyle\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[z_{i}^{f^{% \prime}}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] =𝔼siPsi,aio[(f(si,ai)(𝒯gf)(si,ai))(f(si,ai)+(𝒯gf)(si,ai)2yi)]absentsubscript𝔼similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖superscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖2subscript𝑦𝑖\displaystyle=\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(f^{\prime}(% s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))\cdot(f^{\prime}(s_{i},a_{i})+(% \mathcal{T}_{g}f)(s_{i},a_{i})-2y_{i})]= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ⋅ ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - 2 italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]
=(f(si,ai)(𝒯gf)(si,ai))2,absentsuperscriptsuperscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖2\displaystyle=(f^{\prime}(s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))^{2},= ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
𝔼siPsi,aio[(zif)2]subscript𝔼similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscriptsuperscriptsubscript𝑧𝑖superscript𝑓2\displaystyle\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(z_{i}^{f^{% \prime}})^{2}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼siPsi,aio[(f(si,ai)(𝒯gf)(si,ai))2(f(si,ai)+(𝒯gf)(si,ai)2yi)2]absentsubscript𝔼similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscriptsuperscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖2superscriptsuperscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖2subscript𝑦𝑖2\displaystyle=\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(f^{\prime}(% s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))^{2}\cdot(f^{\prime}(s_{i},a_{i})% +(\mathcal{T}_{g}f)(s_{i},a_{i})-2y_{i})^{2}]= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - 2 italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(f(si,ai)(𝒯gf)(si,ai))2𝔼siPsi,aio[(f(si,ai)+(𝒯gf)(si,ai)2yi)2]absentsuperscriptsuperscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖2subscript𝔼similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscriptsuperscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖2subscript𝑦𝑖2\displaystyle=(f^{\prime}(s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))^{2}% \cdot\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(f^{\prime}(s_{i},a_{% i})+(\mathcal{T}_{g}f)(s_{i},a_{i})-2y_{i})^{2}]= ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - 2 italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
C1(f(si,ai)(𝒯gf)(si,ai))2,absentsubscript𝐶1superscriptsuperscript𝑓subscript𝑠𝑖subscript𝑎𝑖subscript𝒯𝑔𝑓subscript𝑠𝑖subscript𝑎𝑖2\displaystyle\leq C_{1}(f^{\prime}(s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}% ))^{2},≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where C1=2(1γ)2+18(1+γc1)subscript𝐶12superscript1𝛾2181𝛾subscript𝑐1C_{1}=\frac{2}{(1-\gamma)^{2}}+18(1+\gamma c_{1})italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 18 ( 1 + italic_γ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). This immediately implies that

𝔼si,aiμ,siPsi,aio[zif]subscript𝔼formulae-sequencesimilar-tosubscript𝑠𝑖subscript𝑎𝑖𝜇similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscriptsubscript𝑧𝑖superscript𝑓\displaystyle\mathbb{E}_{s_{i},a_{i}\sim\mu,s^{\prime}_{i}\sim P^{o}_{s_{i},a_% {i}}}[z_{i}^{f^{\prime}}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] =𝒯gff2,μ2,absentsubscriptsuperscriptnormsubscript𝒯𝑔𝑓superscript𝑓22𝜇\displaystyle=\left\|\mathcal{T}_{g}f-f^{\prime}\right\|^{2}_{2,\mu},= ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT ,
𝔼si,aiμ,siPsi,aio[(zif)2]subscript𝔼formulae-sequencesimilar-tosubscript𝑠𝑖subscript𝑎𝑖𝜇similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscriptsuperscriptsubscript𝑧𝑖superscript𝑓2\displaystyle\mathbb{E}_{s_{i},a_{i}\sim\mu,s^{\prime}_{i}\sim P^{o}_{s_{i},a_% {i}}}[(z_{i}^{f^{\prime}})^{2}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] C1𝒯gff2,μ2.absentsubscript𝐶1subscriptsuperscriptnormsubscript𝒯𝑔𝑓superscript𝑓22𝜇\displaystyle\leq C_{1}\left\|\mathcal{T}_{g}f-f^{\prime}\right\|^{2}_{2,\mu}.≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT .

From these calculations, it is also straightforward to see that |zif𝔼si,aiμ,siPsi,aio[zif]|2C1superscriptsubscript𝑧𝑖superscript𝑓subscript𝔼formulae-sequencesimilar-tosubscript𝑠𝑖subscript𝑎𝑖𝜇similar-tosubscriptsuperscript𝑠𝑖subscriptsuperscript𝑃𝑜subscript𝑠𝑖subscript𝑎𝑖delimited-[]superscriptsubscript𝑧𝑖superscript𝑓2subscript𝐶1|z_{i}^{f^{\prime}}-\mathbb{E}_{s_{i},a_{i}\sim\mu,s^{\prime}_{i}\sim P^{o}_{s% _{i},a_{i}}}[z_{i}^{f^{\prime}}]|\leq 2C_{1}| italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] | ≤ 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT almost surely.

Now, using the Bernstein’s inequality (Lemma 2), together with a union bound over all fsuperscript𝑓f^{\prime}\in\mathcal{F}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

|𝒯gff2,μ21Ni=1Nzif|2C1𝒯gff2,μ2log(2||/δ)N+2C1log(2||/δ)3N,superscriptsubscriptnormsubscript𝒯𝑔𝑓superscript𝑓2𝜇21𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑧𝑖superscript𝑓2subscript𝐶1superscriptsubscriptnormsubscript𝒯𝑔𝑓superscript𝑓2𝜇22𝛿𝑁2subscript𝐶12𝛿3𝑁\displaystyle|\|\mathcal{T}_{g}f-f^{\prime}\|_{2,\mu}^{2}-\frac{1}{N}\sum_{i=1% }^{N}z_{i}^{f^{\prime}}|\leq\sqrt{\frac{2C_{1}\|\mathcal{T}_{g}f-f^{\prime}\|_% {2,\mu}^{2}\log(2|\mathcal{F}|/\delta)}{N}}+\frac{2C_{1}\log(2|\mathcal{F}|/% \delta)}{3N},| ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | ≤ square-root start_ARG divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 2 | caligraphic_F | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( 2 | caligraphic_F | / italic_δ ) end_ARG start_ARG 3 italic_N end_ARG , (23)

for all fsuperscript𝑓f^{\prime}\in\mathcal{F}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F. This expression coincides with Panaganti et al., (2022, Eq.(15)). Thus, following the proof of Panaganti et al., (2022, Lemma 7), we finally get

𝒯gff^g2,μ2superscriptsubscriptnormsubscript𝒯𝑔𝑓subscript^𝑓𝑔2𝜇2\displaystyle\|\mathcal{T}_{g}f-\widehat{f}_{g}\|_{2,\mu}^{2}∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 6ε+9C1log(4||/δ)N.absent6subscript𝜀9subscript𝐶14𝛿𝑁\displaystyle\leq 6\varepsilon_{\mathcal{F}}+\frac{9C_{1}\log(4|\mathcal{F}|/% \delta)}{N}.≤ 6 italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT + divide start_ARG 9 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( 4 | caligraphic_F | / italic_δ ) end_ARG start_ARG italic_N end_ARG . (24)

We note a fact x+yx+y𝑥𝑦𝑥𝑦\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}square-root start_ARG italic_x + italic_y end_ARG ≤ square-root start_ARG italic_x end_ARG + square-root start_ARG italic_y end_ARG. Now, using union bound for f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F and g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we finally obtain

supfsupg𝒢𝒯gff^g2,μsubscriptsupremum𝑓subscriptsupremum𝑔𝒢subscriptnormsubscript𝒯𝑔𝑓subscript^𝑓𝑔2𝜇\displaystyle\sup_{f\in\mathcal{F}}\sup_{g\in\mathcal{G}}\|\mathcal{T}_{g}f-% \widehat{f}_{g}\|_{2,\mu}roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT 6ε+18C1log(2|||𝒢|/δ)N.absent6subscript𝜀18subscript𝐶12𝒢𝛿𝑁\displaystyle\leq\sqrt{6\varepsilon_{\mathcal{F}}}+\sqrt{\frac{18C_{1}\log(2|% \mathcal{F}||\mathcal{G}|/\delta)}{N}}.≤ square-root start_ARG 6 italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT end_ARG + square-root start_ARG divide start_ARG 18 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( 2 | caligraphic_F | | caligraphic_G | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG .

This completes the least-squares generalization bound analysis for the robust regularized Bellman updates. ∎

We are now ready to prove the main theorem.

D.1 Proof of Theorem 1 ☕☕☕

Theorem 3 (Restatement of Theorem 1).

Let Assumptions 1, 2 and 3 hold. Let πKsubscript𝜋𝐾\pi_{K}italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT be the RPQ algorithm policy after K𝐾Kitalic_K iterations. Then, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

VπVπKsuperscript𝑉superscript𝜋superscript𝑉subscript𝜋𝐾absent\displaystyle V^{\pi^{*}}-V^{\pi_{K}}\leqitalic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ 2γK(1γ)2+2C(1γ)2(2γc2c32log(|𝒢|)N+5c12log(8||/δ)N+γε𝒢)2superscript𝛾𝐾superscript1𝛾22𝐶superscript1𝛾22𝛾subscript𝑐2subscript𝑐32𝒢𝑁5subscript𝑐128𝛿𝑁𝛾subscript𝜀𝒢\displaystyle\frac{2\gamma^{K}}{(1-\gamma)^{2}}+\frac{2\sqrt{C}}{(1-\gamma)^{2% }}(2\gamma c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{G}|)}{N}}+5c_{1}\sqrt{\frac{2% \log(8|\mathcal{F}|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}})divide start_ARG 2 italic_γ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 square-root start_ARG italic_C end_ARG end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( 2 italic_γ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( | caligraphic_G | ) end_ARG start_ARG italic_N end_ARG end_ARG + 5 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 8 | caligraphic_F | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG + italic_γ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT )
+2C(1γ)2(6ε+2(1γ)2+18(1+γc1)18log(2|||𝒢|/δ)N).2𝐶superscript1𝛾26subscript𝜀2superscript1𝛾2181𝛾subscript𝑐1182𝒢𝛿𝑁\displaystyle+\frac{2\sqrt{C}}{(1-\gamma)^{2}}(\sqrt{6\varepsilon_{\mathcal{F}% }}+\sqrt{\frac{2}{(1-\gamma)^{2}}+18(1+\gamma c_{1})}\sqrt{\frac{18\log(2|% \mathcal{F}||\mathcal{G}|/\delta)}{N}}).+ divide start_ARG 2 square-root start_ARG italic_C end_ARG end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( square-root start_ARG 6 italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT end_ARG + square-root start_ARG divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 18 ( 1 + italic_γ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG square-root start_ARG divide start_ARG 18 roman_log ( 2 | caligraphic_F | | caligraphic_G | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG ) .
Proof.

We let Vk(s)=Qk(s,πk(s))subscript𝑉𝑘𝑠subscript𝑄𝑘𝑠subscript𝜋𝑘𝑠V_{k}(s)=Q_{k}(s,\pi_{k}(s))italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) = italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ) for every s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. Since πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the greedy policy w.r.t Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we also have Vk(s)=Qk(s,πk(s))=maxaQk(s,a)subscript𝑉𝑘𝑠subscript𝑄𝑘𝑠subscript𝜋𝑘𝑠subscript𝑎subscript𝑄𝑘𝑠𝑎V_{k}(s)=Q_{k}(s,\pi_{k}(s))=\max_{a}Q_{k}(s,a)italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) = italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_a ). We recall that V=Vπsuperscript𝑉superscript𝑉superscript𝜋V^{*}=V^{\pi^{*}}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and Q=Qπsuperscript𝑄superscript𝑄superscript𝜋Q^{*}=Q^{\pi^{*}}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. We also recall from Section 2 that Qπsuperscript𝑄superscript𝜋Q^{\pi^{*}}italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a fixed-point of the robust Bellman operator 𝒯𝒯\mathcal{T}caligraphic_T defined in (3). We also note that the same holds true for any stationary deterministic policy π𝜋\piitalic_π from Yang et al., (2023) that Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT satisfies Qπ(s,a)=r(s,a)+γminPs,aPs,ao(𝔼sPs,a[Vπ(s)]+λDφ(Ps,a,Ps,ao)).superscript𝑄𝜋𝑠𝑎𝑟𝑠𝑎𝛾subscriptmuch-less-thansubscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]superscript𝑉𝜋superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎Q^{\pi}(s,a)=r(s,a)+\gamma\min_{P_{s,a}\ll P^{o}_{s,a}}(\mathbb{E}_{s^{\prime}% \sim P_{s,a}}[V^{\pi}(s^{\prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})).italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_r ( italic_s , italic_a ) + italic_γ roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ) . We now adapt the proof of Panaganti et al., (2022, Theorem 1) using the RRBE in its primal form (3) directly instead of its dual form (4).

We first characterize the performance decomposition between Vπsuperscript𝑉superscript𝜋V^{\pi^{*}}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and VπKsuperscript𝑉subscript𝜋𝐾{V}^{\pi_{K}}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We recall the initial state distribution d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Since Vπ(s)VπK(s)superscript𝑉superscript𝜋𝑠superscript𝑉subscript𝜋𝐾𝑠V^{\pi^{*}}(s)\geq V^{\pi_{K}}(s)italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s ) for any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, we observe that

00absent\displaystyle 0\leq0 ≤ 𝔼s0d0[Vπ(s0)VπK(s0)]=𝔼s0d0[(Vπ(s0)VK(s0))(VπK(s0)VK(s0))]subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉superscript𝜋subscript𝑠0superscript𝑉subscript𝜋𝐾subscript𝑠0subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉superscript𝜋subscript𝑠0subscript𝑉𝐾subscript𝑠0superscript𝑉subscript𝜋𝐾subscript𝑠0subscript𝑉𝐾subscript𝑠0\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}(s_{0})-{V}^{\pi_{K}}(s_{% 0})]=\mathbb{E}_{s_{0}\sim d_{0}}[(V^{\pi^{*}}(s_{0})-V_{K}(s_{0}))-(V^{\pi_{K% }}(s_{0})-V_{K}(s_{0}))]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ]
=𝔼s0d0[(Qπ(s0,π(s0))QK(s0,πK(s0)))(QπK(s0,πK(s0))QK(s0,πK(s0)))]absentsubscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0superscript𝜋subscript𝑠0subscript𝑄𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0superscript𝑄subscript𝜋𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0subscript𝑄𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0\displaystyle=\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0}))-% Q_{K}(s_{0},\pi_{K}(s_{0})))-(Q^{\pi_{K}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},% \pi_{K}(s_{0})))]= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) - ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) ]
(a)𝔼s0d0[Qπ(s0,π(s0))QK(s0,π(s0))+QK(s0,πK(s0))QπK(s0,πK(s0))]superscript𝑎absentsubscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0superscript𝜋subscript𝑠0subscript𝑄𝐾subscript𝑠0superscript𝜋subscript𝑠0subscript𝑄𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0superscript𝑄subscript𝜋𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0}))-Q_{K}(s_{0},\pi^{*}(s_{0}))+Q_{K}(s_{0},\pi_% {K}(s_{0}))-Q^{\pi_{K}}(s_{0},\pi_{K}(s_{0}))]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ]
=𝔼s0d0[Qπ(s0,π(s0))QK(s0,π(s0))+QK(s0,πK(s0))Qπ(s0,πK(s0))\displaystyle=\mathbb{E}_{s_{0}\sim d_{0}}[Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0}))-Q% _{K}(s_{0},\pi^{*}(s_{0}))+Q_{K}(s_{0},\pi_{K}(s_{0}))-Q^{\pi^{*}}(s_{0},\pi_{% K}(s_{0}))= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
+Qπ(s0,πK(s0))QπK(s0,πK(s0))]\displaystyle\hskip 142.26378pt+Q^{\pi^{*}}(s_{0},\pi_{K}(s_{0}))-Q^{\pi_{K}}(% s_{0},\pi_{K}(s_{0}))]+ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ]
(b)𝔼s0d0[Qπ(s0,π(s0))QK(s0,π(s0))+QK(s0,πK(s0))Qπ(s0,πK(s0))\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0}))-Q_{K}(s_{0},\pi^{*}(s_{0}))+Q_{K}(s_{0},\pi_% {K}(s_{0}))-Q^{\pi^{*}}(s_{0},\pi_{K}(s_{0}))start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
+γ[minPs0,πK(s0)Ps0,πK(s0)o(𝔼s1Ps0,πK(s0)[Vπ(s1)]+λDφ(Ps0,πK(s0),Ps0,πK(s0)o))\displaystyle\hskip 56.9055pt+\gamma[\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{% s_{0},\pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi^% {*}}(s_{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}% (s_{0})}))+ italic_γ [ roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) )
minPs0,πK(s0)Ps0,πK(s0)o(𝔼s1Ps0,πK(s0)[VπK(s1)]+λDφ(Ps0,πK(s0),Ps0,πK(s0)o))]]\displaystyle\hskip 85.35826pt-\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{s_{0},% \pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi_{K}}(s% _{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}(s_{0}% )}))]]- roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) ) ] ]
(c)𝔼s0d0[|Qπ(s0,π(s0))QK(s0,π(s0))|]+𝔼s0d0[|Qπ(s0,πK(s0))QK(s0,πK(s0))|]superscript𝑐absentsubscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0superscript𝜋subscript𝑠0subscript𝑄𝐾subscript𝑠0superscript𝜋subscript𝑠0subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0subscript𝜋𝐾subscript𝑠0subscript𝑄𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% |Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0}))-Q_{K}(s_{0},\pi^{*}(s_{0}))|]+\mathbb{E}_{s% _{0}\sim d_{0}}[|Q^{\pi^{*}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},\pi_{K}(s_{0}))|]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_c ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) | ] + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) | ]
+γ𝔼s0d0𝔼s1Ps0,πK(s0)πK,min(|Vπ(s1)VπK(s1)|)𝛾subscript𝔼similar-tosubscript𝑠0subscript𝑑0subscript𝔼similar-tosubscript𝑠1subscriptsuperscript𝑃subscript𝜋𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0superscript𝑉superscript𝜋subscript𝑠1superscript𝑉subscript𝜋𝐾subscript𝑠1\displaystyle\hskip 113.81102pt+\gamma\mathbb{E}_{s_{0}\sim d_{0}}\mathbb{E}_{% s_{1}\sim P^{\pi_{K},\min}_{s_{0},\pi_{K}(s_{0})}}(|V^{\pi^{*}}(s_{1})-V^{\pi_% {K}}(s_{1})|)+ italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | )
(d)h=0γh(𝔼sdh,πK[|Qπ(s,π(s))QK(s,π(s))|+|Qπ(s,πK(s))QK(s,πK(s))|]),superscript𝑑absentsuperscriptsubscript0superscript𝛾subscript𝔼similar-to𝑠subscript𝑑subscript𝜋𝐾delimited-[]superscript𝑄superscript𝜋𝑠superscript𝜋𝑠subscript𝑄𝐾𝑠superscript𝜋𝑠superscript𝑄superscript𝜋𝑠subscript𝜋𝐾𝑠subscript𝑄𝐾𝑠subscript𝜋𝐾𝑠\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\sum_{h=0}^{\infty}\gamma^{h}% \cdot\bigg{(}\mathbb{E}_{s\sim d_{h,\pi_{K}}}[|Q^{\pi^{*}}(s,\pi^{*}(s))-Q_{K}% (s,\pi^{*}(s))|+|Q^{\pi^{*}}(s,\pi_{K}(s))-Q_{K}(s,\pi_{K}(s))|]\bigg{)},start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_d ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ ( blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) | + | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) ) | ] ) , (25)

where (a)𝑎(a)( italic_a ) follows from the fact that πKsubscript𝜋𝐾\pi_{K}italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is the greedy policy with respect to QKsubscript𝑄𝐾Q_{K}italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, (b)𝑏(b)( italic_b ) from the Bellman equations, and (c)𝑐(c)( italic_c ) from the following definition

Ps,πK(s)πK,minargminPs,πK(s)Ps,πK(s)o(𝔼sPs,πK(s)[VπK(s)]+λDφ(Ps,πK(s),Ps,πK(s)o)).subscriptsuperscript𝑃subscript𝜋𝐾𝑠subscript𝜋𝐾𝑠subscriptargminmuch-less-thansubscript𝑃𝑠subscript𝜋𝐾𝑠subscriptsuperscript𝑃𝑜𝑠subscript𝜋𝐾𝑠subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠subscript𝜋𝐾𝑠delimited-[]superscript𝑉subscript𝜋𝐾superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠subscript𝜋𝐾𝑠subscriptsuperscript𝑃𝑜𝑠subscript𝜋𝐾𝑠P^{\pi_{K},\min}_{s,\pi_{K}(s)}\in\operatorname*{arg\,min}_{P_{s,\pi_{K}(s)}% \ll P^{o}_{s,\pi_{K}(s)}}(\mathbb{E}_{s^{\prime}\sim P_{s,\pi_{K}(s)}}[V^{\pi_% {K}}(s^{\prime})]+\lambda D_{\varphi}(P_{s,\pi_{K}(s)},P^{o}_{s,\pi_{K}(s)})).italic_P start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ) ) .

We note that this worse-case model distribution can be non-unique and we just pick one by an arbitrary deterministic rule. We emphasize that this model distribution is used only in analysis which is not required in the algorithm. Finally, (d)𝑑(d)( italic_d ) follows with telescoping over |VπVπK|superscript𝑉superscript𝜋superscript𝑉subscript𝜋𝐾|V^{\pi^{*}}-V^{\pi_{K}}|| italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | by defining a state distribution dh,πKΔ(𝒮)subscript𝑑subscript𝜋𝐾Δ𝒮d_{h,\pi_{K}}\in\Delta(\mathcal{S})italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ), for all natural numbers h00h\geq 0italic_h ≥ 0, as

dh,πK={d0if h=0,Ps,πK(s)πK,minotherwise, with sdh1,πK.subscript𝑑subscript𝜋𝐾casessubscript𝑑0if h=0subscriptsuperscript𝑃subscript𝜋𝐾superscript𝑠subscript𝜋𝐾superscript𝑠similar-tootherwise, with superscript𝑠subscript𝑑1subscript𝜋𝐾d_{h,\pi_{K}}=\begin{cases}d_{0}&\text{if $h=0$},\\ P^{\pi_{K},\min}_{s^{\prime},\pi_{K}(s^{\prime})}&\text{otherwise, with }s^{% \prime}\sim d_{h-1,\pi_{K}}.\end{cases}italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { start_ROW start_CELL italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL if italic_h = 0 , end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_CELL start_CELL otherwise, with italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_d start_POSTSUBSCRIPT italic_h - 1 , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW

We note that such state distribution proof ideas are commonly used in the offline RL literature (Agarwal et al.,, 2019; Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023; Zhang et al.,, 2023).

For (25), with the ν𝜈\nuitalic_ν-norm notation i.e. fp,ν2=(𝔼s,aν|f(s,a)|p)1/psuperscriptsubscriptnorm𝑓𝑝𝜈2superscriptsubscript𝔼similar-to𝑠𝑎𝜈superscript𝑓𝑠𝑎𝑝1𝑝\|f\|_{p,\nu}^{2}=(\mathbb{E}_{s,a\sim\nu}|f(s,a)|^{p})^{1/p}∥ italic_f ∥ start_POSTSUBSCRIPT italic_p , italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν end_POSTSUBSCRIPT | italic_f ( italic_s , italic_a ) | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT for any νΔ(𝒮×𝒜)𝜈Δ𝒮𝒜\nu\in\Delta(\mathcal{S}\times\mathcal{A})italic_ν ∈ roman_Δ ( caligraphic_S × caligraphic_A ), we have

𝔼s0d0[Vπ]𝔼s0d0[VπK]h=0γh(QπQK1,dh,πKπ+QπQK1,dh,πKπK),subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉superscript𝜋subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉subscript𝜋𝐾superscriptsubscript0superscript𝛾subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1subscript𝑑subscript𝜋𝐾superscript𝜋subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1subscript𝑑subscript𝜋𝐾subscript𝜋𝐾\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[{V}^{\pi^{*}}]-\mathbb{E}_{s_{0}\sim d% _{0}}[V^{\pi_{K}}]\leq\sum_{h=0}^{\infty}\gamma^{h}\bigg{(}\|Q^{\pi^{*}}-Q_{K}% \|_{1,d_{h,\pi_{K}}\circ\pi^{*}}+\|Q^{\pi^{*}}-Q_{K}\|_{1,d_{h,\pi_{K}}\circ% \pi_{K}}\bigg{)},blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] - blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] ≤ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (26)

where the state-action distributions are dh,πKπ(s,a)dh,πK(s)𝟙(a=π(s))proportional-tosubscript𝑑subscript𝜋𝐾superscript𝜋𝑠𝑎subscript𝑑subscript𝜋𝐾𝑠1𝑎superscript𝜋𝑠d_{h,\pi_{K}}\circ\pi^{*}(s,a)\propto d_{h,\pi_{K}}(s)\mathds{1}(a=\pi^{*}(s))italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∝ italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) blackboard_1 ( italic_a = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) and dh,πKπK(s,a)dh,πK(s)𝟙(a=πK(s))proportional-tosubscript𝑑subscript𝜋𝐾subscript𝜋𝐾𝑠𝑎subscript𝑑subscript𝜋𝐾𝑠1𝑎subscript𝜋𝐾𝑠d_{h,\pi_{K}}\circ\pi_{K}(s,a)\propto d_{h,\pi_{K}}(s)\mathds{1}(a=\pi_{K}(s))italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ) ∝ italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) blackboard_1 ( italic_a = italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) ). We now analyze the above two terms treating either dh,πKπsubscript𝑑subscript𝜋𝐾superscript𝜋d_{h,\pi_{K}}\circ\pi^{*}italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT or dh,πKπKsubscript𝑑subscript𝜋𝐾subscript𝜋𝐾d_{h,\pi_{K}}\circ\pi_{K}italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT as a state-action distribution ν𝜈\nuitalic_ν satisfying Assumption 1. First, considering any s,aνsimilar-to𝑠𝑎𝜈s,a\sim\nuitalic_s , italic_a ∼ italic_ν satisfying Qπ(s,a)QK(s,a)superscript𝑄superscript𝜋𝑠𝑎subscript𝑄𝐾𝑠𝑎Q^{\pi^{*}}(s,a)\geq Q_{K}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) ≥ italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ) we have

00absent\displaystyle 0\leq0 ≤ Qπ(s,a)QK(s,a)Qπ(s,a)𝒯QK1(s,a)+|𝒯QK1(s,a)QK(s,a)|superscript𝑄superscript𝜋𝑠𝑎subscript𝑄𝐾𝑠𝑎superscript𝑄superscript𝜋𝑠𝑎𝒯subscript𝑄𝐾1𝑠𝑎𝒯subscript𝑄𝐾1𝑠𝑎subscript𝑄𝐾𝑠𝑎\displaystyle Q^{\pi^{*}}(s,a)-Q_{K}(s,a)\leq Q^{\pi^{*}}(s,a)-\mathcal{T}Q_{K% -1}(s,a)+|\mathcal{T}Q_{K-1}(s,a)-Q_{K}(s,a)|italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ) ≤ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) - caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + | caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ) |
Qπ(s,a)𝒯QK1(s,a)+𝒯QK1QK1,νabsentsuperscript𝑄superscript𝜋𝑠𝑎𝒯subscript𝑄𝐾1𝑠𝑎subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜈\displaystyle\leq Q^{\pi^{*}}(s,a)-\mathcal{T}Q_{K-1}(s,a)+\|\mathcal{T}Q_{K-1% }-Q_{K}\|_{1,\nu}≤ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) - caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT
(e)Qπ(s,a)𝒯QK1(s,a)+C𝒯QK1QK1,μsuperscript𝑒absentsuperscript𝑄superscript𝜋𝑠𝑎𝒯subscript𝑄𝐾1𝑠𝑎𝐶subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}Q^{\pi^{*}}(s,a)-\mathcal{T}Q% _{K-1}(s,a)+\sqrt{C}\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_e ) end_ARG end_RELOP italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) - caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + square-root start_ARG italic_C end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
=(f)γ[minPs,aPs,ao(𝔼sPs,a[maxaQπ(s,a)]+λDφ(Ps,a,Ps,ao))\displaystyle\stackrel{{\scriptstyle(f)}}{{=}}\gamma[\min_{P_{s,a}\ll P^{o}_{s% ,a}}(\mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q^{\pi^{*}}(s^{% \prime},a^{\prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a}))start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_f ) end_ARG end_RELOP italic_γ [ roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) )
minPs,aPs,ao(𝔼sPs,a[maxaQK1(s,a)]+λDφ(Ps,a,Ps,ao))]\displaystyle\hskip 85.35826pt-\min_{P_{s,a}\ll P^{o}_{s,a}}(\mathbb{E}_{s^{% \prime}\sim P_{s,a}}[\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime})]+\lambda D% _{\varphi}(P_{s,a},P^{o}_{s,a}))]- roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ) ]
+C𝒯QK1QK1,μ𝐶subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\hskip 170.71652pt+\sqrt{C}\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}+ square-root start_ARG italic_C end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(g)γ(𝔼sPs,aQK1,min(maxaQπ(s,a)maxaQK1(s,a)))+C𝒯QK1QK1,μsuperscript𝑔absent𝛾subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃subscript𝑄𝐾1𝑠𝑎subscriptsuperscript𝑎superscript𝑄superscript𝜋superscript𝑠superscript𝑎subscriptsuperscript𝑎subscript𝑄𝐾1superscript𝑠superscript𝑎𝐶subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}\gamma(\mathbb{E}_{s^{\prime}% \sim P^{Q_{K-1},\min}_{s,a}}(\max_{a^{\prime}}Q^{\pi^{*}}(s^{\prime},a^{\prime% })-\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime})))+\sqrt{C}\|\mathcal{T}Q_{K% -1}-Q_{K}\|_{1,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_g ) end_ARG end_RELOP italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) + square-root start_ARG italic_C end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(h)γ(𝔼sPs,aQK1,minmaxa|Qπ(s,a)QK1(s,a)|)+C𝒯QK1QK1,μ,superscriptabsent𝛾subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃subscript𝑄𝐾1𝑠𝑎subscriptsuperscript𝑎superscript𝑄superscript𝜋superscript𝑠superscript𝑎subscript𝑄𝐾1superscript𝑠superscript𝑎𝐶subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\stackrel{{\scriptstyle(h)}}{{\leq}}\gamma(\mathbb{E}_{s^{\prime}% \sim P^{Q_{K-1},\min}_{s,a}}\max_{a^{\prime}}|Q^{\pi^{*}}(s^{\prime},a^{\prime% })-Q_{K-1}(s^{\prime},a^{\prime})|)+\sqrt{C}\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,% \mu},start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_h ) end_ARG end_RELOP italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ) + square-root start_ARG italic_C end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT , (27)

where (e)𝑒(e)( italic_e ) follows by the concentrability assumption (Assumption 1), (f)𝑓(f)( italic_f ) from Bellman equation, operator 𝒯𝒯\mathcal{T}caligraphic_T, (g)𝑔(g)( italic_g ) follows, similarly as step (c)𝑐(c)( italic_c ), from the following definition

Ps,aQK1,minargminPs,aPs,ao(𝔼sPs,a[maxaQK1(s,a)]+λDφ(Ps,a,Ps,ao)).subscriptsuperscript𝑃subscript𝑄𝐾1𝑠𝑎subscriptargminmuch-less-thansubscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑎subscript𝑄𝐾1superscript𝑠superscript𝑎𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎P^{Q_{K-1},\min}_{s,a}\in\operatorname*{arg\,min}_{P_{s,a}\ll P^{o}_{s,a}}(% \mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})).italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ) .

We again emphasize that this model distribution is analysis-specific and we just pick one by an arbitrary deterministic rule since it may not be unique. (h)(h)( italic_h ) follows by the fact |supxp(x)supxq(x)|supx|p(x)q(x)|subscriptsupremum𝑥𝑝𝑥subscriptsupremum𝑥𝑞𝑥subscriptsupremum𝑥𝑝𝑥𝑞𝑥|\sup_{x}p(x)-\sup_{x}q(x)|\leq\sup_{x}|p(x)-q(x)|| roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p ( italic_x ) - roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_q ( italic_x ) | ≤ roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_p ( italic_x ) - italic_q ( italic_x ) |. Now, by replacing PQK1,minsuperscript𝑃subscript𝑄𝐾1P^{Q_{K-1},\min}italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT with PQπ,minsuperscript𝑃superscript𝑄superscript𝜋P^{Q^{\pi^{*}},\min}italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , roman_min end_POSTSUPERSCRIPT in step (g)𝑔(g)( italic_g ) and repeating the steps for any s,aνsimilar-to𝑠𝑎𝜈s,a\sim\nuitalic_s , italic_a ∼ italic_ν satisfying Qπ(s,a)QK(s,a)superscript𝑄superscript𝜋𝑠𝑎subscript𝑄𝐾𝑠𝑎Q^{\pi^{*}}(s,a)\leq Q_{K}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) ≤ italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ), we get

0QK(s,a)Qπ(s,a)γ(𝔼sPs,aQπ,minmaxa|Qπ(s,a)QK1(s,a)|)+C𝒯QK1QK1,μ.0subscript𝑄𝐾𝑠𝑎superscript𝑄superscript𝜋𝑠𝑎𝛾subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃superscript𝑄superscript𝜋𝑠𝑎subscriptsuperscript𝑎superscript𝑄superscript𝜋superscript𝑠superscript𝑎subscript𝑄𝐾1superscript𝑠superscript𝑎𝐶subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle 0\leq Q_{K}(s,a)-Q^{\pi^{*}}(s,a)\leq\gamma(\mathbb{E}_{s^{% \prime}\sim P^{Q^{\pi^{*}},\min}_{s,a}}\max_{a^{\prime}}|Q^{\pi^{*}}(s^{\prime% },a^{\prime})-Q_{K-1}(s^{\prime},a^{\prime})|)+\sqrt{C}\|\mathcal{T}Q_{K-1}-Q_% {K}\|_{1,\mu}.0 ≤ italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) ≤ italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ) + square-root start_ARG italic_C end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT . (28)

We immediately note that both Ps,aQK1,minsubscriptsuperscript𝑃subscript𝑄𝐾1𝑠𝑎P^{Q_{K-1},\min}_{s,a}italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT and Ps,aQπ,minsubscriptsuperscript𝑃superscript𝑄superscript𝜋𝑠𝑎P^{Q^{\pi^{*}},\min}_{s,a}italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT satisfies Dφ(Ps,aQK1,min,Ps,ao)1/(λ(1γ))subscript𝐷𝜑subscriptsuperscript𝑃subscript𝑄𝐾1𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎1𝜆1𝛾D_{\varphi}(P^{Q_{K-1},\min}_{s,a},P^{o}_{s,a})\leq 1/(\lambda(1-\gamma))italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ≤ 1 / ( italic_λ ( 1 - italic_γ ) ) and Dφ(Ps,aQπ,min,Ps,ao)1/(λ(1γ))subscript𝐷𝜑subscriptsuperscript𝑃superscript𝑄superscript𝜋𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎1𝜆1𝛾D_{\varphi}(P^{Q^{\pi^{*}},\min}_{s,a},P^{o}_{s,a})\leq 1/(\lambda(1-\gamma))italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) ≤ 1 / ( italic_λ ( 1 - italic_γ ) ), which follows by their definition and the facts QK1subscript𝑄𝐾1Q_{K-1}\in\mathcal{F}italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∈ caligraphic_F, Qπ1/(1γ)subscriptnormsuperscript𝑄superscript𝜋11𝛾\|Q^{\pi^{*}}\|_{\infty}\leq 1/(1-\gamma)∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1 / ( 1 - italic_γ ). Define the state-action probability distribution νsuperscript𝜈\nu^{\prime}italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as, for any s,asuperscript𝑠superscript𝑎s^{\prime},a^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT,

ν(s,a)superscript𝜈superscript𝑠superscript𝑎\displaystyle\nu^{\prime}(s^{\prime},a^{\prime})italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =s,aν(s,a)𝟙{Qπ(s,a)>QK(s,a)}Ps,aQK1,min(s)𝟙{a=argmaxb|Qπ(s,b)QK1(s,b)|}absentsubscript𝑠𝑎𝜈𝑠𝑎1superscript𝑄superscript𝜋𝑠𝑎subscript𝑄𝐾𝑠𝑎subscriptsuperscript𝑃subscript𝑄𝐾1𝑠𝑎superscript𝑠1superscript𝑎subscriptargmax𝑏superscript𝑄superscript𝜋superscript𝑠𝑏subscript𝑄𝐾1superscript𝑠𝑏\displaystyle=\sum_{s,a}\nu(s,a)\mathds{1}\{Q^{\pi^{*}}(s,a)>Q_{K}(s,a)\}P^{Q_% {K-1},\min}_{s,a}(s^{\prime})\mathds{1}\{a^{\prime}=\operatorname*{arg\,max}_{% b}|Q^{\pi^{*}}(s^{\prime},b)-Q_{K-1}(s^{\prime},b)|\}= ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ν ( italic_s , italic_a ) blackboard_1 { italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) > italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ) } italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) blackboard_1 { italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) | }
+s,aν(s,a)𝟙{Qπ(s,a)QK(s,a)}Ps,aQπ,min(s)𝟙{a=argmaxb|Qπ(s,b)QK1(s,b)|}.subscript𝑠𝑎𝜈𝑠𝑎1superscript𝑄superscript𝜋𝑠𝑎subscript𝑄𝐾𝑠𝑎subscriptsuperscript𝑃superscript𝑄superscript𝜋𝑠𝑎superscript𝑠1superscript𝑎subscriptargmax𝑏superscript𝑄superscript𝜋superscript𝑠𝑏subscript𝑄𝐾1superscript𝑠𝑏\displaystyle\hskip 28.45274pt+\sum_{s,a}\nu(s,a)\mathds{1}\{Q^{\pi^{*}}(s,a)% \leq Q_{K}(s,a)\}P^{Q^{\pi^{*}},\min}_{s,a}(s^{\prime})\mathds{1}\{a^{\prime}=% \operatorname*{arg\,max}_{b}|Q^{\pi^{*}}(s^{\prime},b)-Q_{K-1}(s^{\prime},b)|\}.+ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ν ( italic_s , italic_a ) blackboard_1 { italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) ≤ italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_a ) } italic_P start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , roman_min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) blackboard_1 { italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) | } .

Now, we can combine (27)-(28) as follows

QπQK1,νsubscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1𝜈\displaystyle\|Q^{\pi^{*}}-Q_{K}\|_{1,\nu}∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT γQπQK11,ν+C𝒯QK1QK1,μabsent𝛾subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾11superscript𝜈𝐶subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\leq\gamma\|Q^{\pi^{*}}-Q_{K-1}\|_{1,\nu^{\prime}}+\sqrt{C}\|% \mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}≤ italic_γ ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + square-root start_ARG italic_C end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(i)γQπQK11,ν+C𝒯gK1QK1QK2,μ+C𝒯QK1𝒯gK1QK11,μ,superscript𝑖absent𝛾subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾11superscript𝜈𝐶subscriptnormsubscript𝒯subscript𝑔𝐾1subscript𝑄𝐾1subscript𝑄𝐾2𝜇𝐶subscriptnorm𝒯subscript𝑄𝐾1subscript𝒯subscript𝑔𝐾1subscript𝑄𝐾11𝜇\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\gamma\|Q^{\pi^{*}}-Q_{K-1}\|% _{1,\nu^{\prime}}+\sqrt{C}\|\mathcal{T}_{g_{K-1}}Q_{K-1}-Q_{K}\|_{2,\mu}+\sqrt% {C}\|\mathcal{T}Q_{K-1}-\mathcal{T}_{g_{K-1}}Q_{K-1}\|_{1,\mu},start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP italic_γ ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + square-root start_ARG italic_C end_ARG ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT + square-root start_ARG italic_C end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ,

where (i)𝑖(i)( italic_i ) uses the fact 1,μ2,μ\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ≤ ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT.

Now, by recursion until iteration 0, we get

\displaystyle\| QπQK1,νγKsupν¯QπQ01,ν¯+Ct=0K1γt𝒯QK1t𝒯gK1tQK1t1,μsuperscript𝑄superscript𝜋evaluated-atsubscript𝑄𝐾1𝜈superscript𝛾𝐾subscriptsupremum¯𝜈subscriptnormsuperscript𝑄superscript𝜋subscript𝑄01¯𝜈𝐶superscriptsubscript𝑡0𝐾1superscript𝛾𝑡subscriptnorm𝒯subscript𝑄𝐾1𝑡subscript𝒯subscript𝑔𝐾1𝑡subscript𝑄𝐾1𝑡1𝜇\displaystyle Q^{\pi^{*}}-Q_{K}\|_{1,\nu}\leq\gamma^{K}\sup_{\bar{\nu}}\|Q^{% \pi^{*}}-Q_{0}\|_{1,\bar{\nu}}+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\|\mathcal{T}% Q_{K-1-t}-\mathcal{T}_{g_{K-1-t}}Q_{K-1-t}\|_{1,\mu}italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT ≤ italic_γ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT over¯ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , over¯ start_ARG italic_ν end_ARG end_POSTSUBSCRIPT + square-root start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
+Ct=0K1γt𝒯gK1tQK1tQKt2,μ𝐶superscriptsubscript𝑡0𝐾1superscript𝛾𝑡subscriptnormsubscript𝒯subscript𝑔𝐾1𝑡subscript𝑄𝐾1𝑡subscript𝑄𝐾𝑡2𝜇\displaystyle\hskip 113.81102pt+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\|\mathcal{T% }_{g_{K-1-t}}Q_{K-1-t}-Q_{K-t}\|_{2,\mu}+ square-root start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K - italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT
(j)γK1γ+Ct=0K1γt𝒯QK1t𝒯gK1tQK1t1,μsuperscript𝑗absentsuperscript𝛾𝐾1𝛾𝐶superscriptsubscript𝑡0𝐾1superscript𝛾𝑡subscriptnorm𝒯subscript𝑄𝐾1𝑡subscript𝒯subscript𝑔𝐾1𝑡subscript𝑄𝐾1𝑡1𝜇\displaystyle\stackrel{{\scriptstyle(j)}}{{\leq}}\frac{\gamma^{K}}{1-\gamma}+% \sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\|\mathcal{T}Q_{K-1-t}-\mathcal{T}_{g_{K-1-t% }}Q_{K-1-t}\|_{1,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_j ) end_ARG end_RELOP divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG + square-root start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
+Ct=0K1γt𝒯gK1tQK1tQKt2,μ𝐶superscriptsubscript𝑡0𝐾1superscript𝛾𝑡subscriptnormsubscript𝒯subscript𝑔𝐾1𝑡subscript𝑄𝐾1𝑡subscript𝑄𝐾𝑡2𝜇\displaystyle\hskip 113.81102pt+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\|\mathcal{T% }_{g_{K-1-t}}Q_{K-1-t}-Q_{K-t}\|_{2,\mu}+ square-root start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 - italic_t end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K - italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT
(k)γK1γ+C1γsupf𝒯f𝒯g^ff1,μ+C1γsupf𝒯g^fff^g^f2,μsuperscript𝑘absentsuperscript𝛾𝐾1𝛾𝐶1𝛾subscriptsupremum𝑓subscriptnorm𝒯𝑓subscript𝒯subscript^𝑔𝑓𝑓1𝜇𝐶1𝛾subscriptsupremum𝑓subscriptnormsubscript𝒯subscript^𝑔𝑓𝑓subscript^𝑓subscript^𝑔𝑓2𝜇\displaystyle\stackrel{{\scriptstyle(k)}}{{\leq}}\frac{\gamma^{K}}{1-\gamma}+% \frac{\sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\|\mathcal{T}f-\mathcal{T}_{% \widehat{g}_{f}}f\|_{1,\mu}+\frac{\sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\|% \mathcal{T}_{\widehat{g}_{f}}f-\widehat{f}_{\widehat{g}_{f}}\|_{2,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_k ) end_ARG end_RELOP divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG + divide start_ARG square-root start_ARG italic_C end_ARG end_ARG start_ARG 1 - italic_γ end_ARG roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ caligraphic_T italic_f - caligraphic_T start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT + divide start_ARG square-root start_ARG italic_C end_ARG end_ARG start_ARG 1 - italic_γ end_ARG roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT
γK1γ+C1γsupf𝒯f𝒯g^ff1,μ+C1γsupfsupg𝒢𝒯gff^g2,μ.absentsuperscript𝛾𝐾1𝛾𝐶1𝛾subscriptsupremum𝑓subscriptnorm𝒯𝑓subscript𝒯subscript^𝑔𝑓𝑓1𝜇𝐶1𝛾subscriptsupremum𝑓subscriptsupremum𝑔𝒢subscriptnormsubscript𝒯𝑔𝑓subscript^𝑓𝑔2𝜇\displaystyle\leq\frac{\gamma^{K}}{1-\gamma}+\frac{\sqrt{C}}{1-\gamma}\sup_{f% \in\mathcal{F}}\|\mathcal{T}f-\mathcal{T}_{\widehat{g}_{f}}f\|_{1,\mu}+\frac{% \sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\sup_{g\in\mathcal{G}}\|\mathcal{T}_{% g}f-\widehat{f}_{g}\|_{2,\mu}.≤ divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG + divide start_ARG square-root start_ARG italic_C end_ARG end_ARG start_ARG 1 - italic_γ end_ARG roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ caligraphic_T italic_f - caligraphic_T start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT + divide start_ARG square-root start_ARG italic_C end_ARG end_ARG start_ARG 1 - italic_γ end_ARG roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT . (29)

where (j)𝑗(j)( italic_j ) follows since |Qπ(s,a)|1/(1γ),Q0(s,a)=0formulae-sequencesuperscript𝑄superscript𝜋𝑠𝑎11𝛾subscript𝑄0𝑠𝑎0|Q^{\pi^{*}}(s,a)|\leq 1/(1-\gamma),Q_{0}(s,a)=0| italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) | ≤ 1 / ( 1 - italic_γ ) , italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0, and (k)𝑘(k)( italic_k ) follows since g^fsubscript^𝑔𝑓\widehat{g}_{f}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the dual variable function from the algorithm for the state-action value function f𝑓fitalic_f and f^gsubscript^𝑓𝑔\widehat{f}_{g}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT as the least squares solution from the algorithm for the state-action value function f𝑓fitalic_f and dual variable function g𝑔gitalic_g pair.

Now, using Lemma 7 and Lemma 8 to bound (29), and then combining it with (26), completes the proof of this theorem. ∎

D.2 Specialized Result for TV φ𝜑\varphiitalic_φ-divergence ☕☕☕

We now state and prove the improved (in terms of assumptions) result for TV φ𝜑\varphiitalic_φ-divergence.

Assumption 9 (Concentrability).

There exists a finite constant Ctv>0subscript𝐶tv0C_{\mathrm{tv}}>0italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT > 0 such that for any ν{dπ,Po}Δ(𝒮×𝒜)𝜈subscript𝑑𝜋superscript𝑃𝑜Δ𝒮𝒜\nu\in\{d_{\pi,P^{o}}\}\subseteq\Delta(\mathcal{S}\times\mathcal{A})italic_ν ∈ { italic_d start_POSTSUBSCRIPT italic_π , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } ⊆ roman_Δ ( caligraphic_S × caligraphic_A ) for any policy π𝜋\piitalic_π (can be non-stationary as well), we have ν/μCtvsubscriptnorm𝜈𝜇subscript𝐶tv\left\|\nu/\mu\right\|_{\infty}\leq\sqrt{C_{\mathrm{tv}}}∥ italic_ν / italic_μ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG.

Assumption 10 (Fail-state).

There is a fail state sfsubscript𝑠𝑓s_{f}italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT such that r(sf,a)=0𝑟subscript𝑠𝑓𝑎0r(s_{f},a)=0italic_r ( italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_a ) = 0 and Psf,a(sf)=1subscript𝑃subscript𝑠𝑓𝑎subscript𝑠𝑓1P_{s_{f},a}(s_{f})=1italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) = 1, for all a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A and P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P satisfying DTV(Ps,a,Ps,ao)max{1,1/(λ(1γ))}subscript𝐷TVsubscript𝑃superscript𝑠superscript𝑎subscriptsuperscript𝑃𝑜superscript𝑠superscript𝑎11𝜆1𝛾D_{\mathrm{TV}}(P_{s^{\prime},a^{\prime}},P^{o}_{s^{\prime},a^{\prime}})\leq% \max\{1,1/(\lambda(1-\gamma))\}italic_D start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≤ roman_max { 1 , 1 / ( italic_λ ( 1 - italic_γ ) ) } for all s,asuperscript𝑠superscript𝑎s^{\prime},a^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Theorem 4.

Let Assumptions 9, 2, 3 and 10 hold. Let πKsubscript𝜋𝐾\pi_{K}italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT be the RPQ algorithm policy after K𝐾Kitalic_K iterations. Then, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

VπVπKsuperscript𝑉superscript𝜋superscript𝑉subscript𝜋𝐾absent\displaystyle V^{\pi^{*}}-V^{\pi_{K}}\leqitalic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ 2γK(1γ)2+2Ctv(1γ)2(2γc2c32log(|𝒢|)N+5c12log(8||/δ)N+γε𝒢)2superscript𝛾𝐾superscript1𝛾22subscript𝐶tvsuperscript1𝛾22𝛾subscript𝑐2subscript𝑐32𝒢𝑁5subscript𝑐128𝛿𝑁𝛾subscript𝜀𝒢\displaystyle\frac{2\gamma^{K}}{(1-\gamma)^{2}}+\frac{2\sqrt{C_{\mathrm{tv}}}}% {(1-\gamma)^{2}}(2\gamma c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{G}|)}{N}}+5c_{1% }\sqrt{\frac{2\log(8|\mathcal{F}|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}})divide start_ARG 2 italic_γ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( 2 italic_γ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( | caligraphic_G | ) end_ARG start_ARG italic_N end_ARG end_ARG + 5 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 8 | caligraphic_F | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG + italic_γ italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT )
+2Ctv(1γ)2(6ε+2(1γ)2+18(1+γc1)18log(2|||𝒢|/δ)N),2subscript𝐶tvsuperscript1𝛾26subscript𝜀2superscript1𝛾2181𝛾subscript𝑐1182𝒢𝛿𝑁\displaystyle+\frac{2\sqrt{C_{\mathrm{tv}}}}{(1-\gamma)^{2}}(\sqrt{6% \varepsilon_{\mathcal{F}}}+\sqrt{\frac{2}{(1-\gamma)^{2}}+18(1+\gamma c_{1})}% \sqrt{\frac{18\log(2|\mathcal{F}||\mathcal{G}|/\delta)}{N}}),+ divide start_ARG 2 square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( square-root start_ARG 6 italic_ε start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT end_ARG + square-root start_ARG divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 18 ( 1 + italic_γ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG square-root start_ARG divide start_ARG 18 roman_log ( 2 | caligraphic_F | | caligraphic_G | / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG ) ,

with c1=2λ+(1/(1γ)),c2=2,c3=λ/2formulae-sequencesubscript𝑐12𝜆11𝛾formulae-sequencesubscript𝑐22subscript𝑐3𝜆2c_{1}=2\lambda+(1/(1-\gamma)),c_{2}=2,c_{3}=\lambda/2italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 italic_λ + ( 1 / ( 1 - italic_γ ) ) , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_λ / 2.

Proof.

We can now further use the dual form (4) under Assumption 10. We again start by characterizing the performance decomposition between Vπsuperscript𝑉superscript𝜋V^{\pi^{*}}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and VπKsuperscript𝑉subscript𝜋𝐾{V}^{\pi_{K}}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. This proof largely follows the proofs of Theorem 1 and Panaganti et al., (2022, Theorem 1). In particular, we use the total variation RRBE its dual form (4) under Assumption 10 in this proof. That is, for all π𝜋\piitalic_π and Q𝑄Q\in\mathcal{F}italic_Q ∈ caligraphic_F, from (17) we have

Qπ(s,a)superscript𝑄𝜋𝑠𝑎\displaystyle Q^{\pi}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) =r(s,a)infη[0,λ](𝔼sPs,ao[(ηVπ(s))+]η) andabsent𝑟𝑠𝑎subscriptinfimum𝜂0𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]subscript𝜂superscript𝑉𝜋superscript𝑠𝜂 and\displaystyle=r(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}\sim P% ^{o}_{s,a}}[(\eta-V^{\pi}(s^{\prime}))_{+}]-\eta)\text{ and}= italic_r ( italic_s , italic_a ) - roman_inf start_POSTSUBSCRIPT italic_η ∈ [ 0 , italic_λ ] end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_η - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η ) and (30)
(𝒯Q)(s,a)𝒯𝑄𝑠𝑎\displaystyle(\mathcal{T}Q)(s,a)( caligraphic_T italic_Q ) ( italic_s , italic_a ) =r(s,a)infη[0,λ](𝔼sPs,ao[(ηmaxaQ(s,a))+]η).absent𝑟𝑠𝑎subscriptinfimum𝜂0𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]subscript𝜂subscriptsuperscript𝑎𝑄superscript𝑠superscript𝑎𝜂\displaystyle=r(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}\sim P% ^{o}_{s,a}}[(\eta-\max_{a^{\prime}}Q(s^{\prime},a^{\prime}))_{+}]-\eta).= italic_r ( italic_s , italic_a ) - roman_inf start_POSTSUBSCRIPT italic_η ∈ [ 0 , italic_λ ] end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η ) .

We recall the initial state distribution d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Since Vπ(s)VπK(s)superscript𝑉superscript𝜋𝑠superscript𝑉subscript𝜋𝐾𝑠V^{\pi^{*}}(s)\geq V^{\pi_{K}}(s)italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s ) for any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, we begin with step (b)𝑏(b)( italic_b ) in Theorem 1:

00absent\displaystyle 0\leq0 ≤ 𝔼s0d0[Vπ(s0)VπK(s0)]subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉superscript𝜋subscript𝑠0superscript𝑉subscript𝜋𝐾subscript𝑠0\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}(s_{0})-{V}^{\pi_{K}}(s_{% 0})]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ]
𝔼s0d0[Qπ(s0,π(s0))QK(s0,π(s0))+QK(s0,πK(s0))Qπ(s0,πK(s0))\displaystyle\leq\mathbb{E}_{s_{0}\sim d_{0}}[Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0})% )-Q_{K}(s_{0},\pi^{*}(s_{0}))+Q_{K}(s_{0},\pi_{K}(s_{0}))-Q^{\pi^{*}}(s_{0},% \pi_{K}(s_{0}))≤ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
+γ[minPs0,πK(s0)Ps0,πK(s0)o(𝔼s1Ps0,πK(s0)[Vπ(s1)]+λDφ(Ps0,πK(s0),Ps0,πK(s0)o))\displaystyle\hskip 56.9055pt+\gamma[\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{% s_{0},\pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi^% {*}}(s_{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}% (s_{0})}))+ italic_γ [ roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) )
minPs0,πK(s0)Ps0,πK(s0)o(𝔼s1Ps0,πK(s0)[VπK(s1)]+λDφ(Ps0,πK(s0),Ps0,πK(s0)o))]]\displaystyle\hskip 85.35826pt-\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{s_{0},% \pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi_{K}}(s% _{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}(s_{0}% )}))]]- roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) ) ] ]
(a)𝔼s0d0[|Qπ(s0,π(s0))QK(s0,π(s0))|]+𝔼s0d0[|Qπ(s0,πK(s0))QK(s0,πK(s0))|]superscript𝑎absentsubscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0superscript𝜋subscript𝑠0subscript𝑄𝐾subscript𝑠0superscript𝜋subscript𝑠0subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0subscript𝜋𝐾subscript𝑠0subscript𝑄𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% |Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0}))-Q_{K}(s_{0},\pi^{*}(s_{0}))|]+\mathbb{E}_{s% _{0}\sim d_{0}}[|Q^{\pi^{*}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},\pi_{K}(s_{0}))|]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) | ] + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) | ]
+γ𝔼s0d0supη(𝔼s1Ps0,πK(s0)o((ηVπK(s1))+(ηVπ(s1))+))𝛾subscript𝔼similar-tosubscript𝑠0subscript𝑑0subscriptsupremum𝜂subscript𝔼similar-tosubscript𝑠1subscriptsuperscript𝑃𝑜subscript𝑠0subscript𝜋𝐾subscript𝑠0subscript𝜂superscript𝑉subscript𝜋𝐾subscript𝑠1subscript𝜂superscript𝑉superscript𝜋subscript𝑠1\displaystyle\hskip 113.81102pt+\gamma\mathbb{E}_{s_{0}\sim d_{0}}\sup_{\eta}(% \mathbb{E}_{s_{1}\sim P^{o}_{s_{0},\pi_{K}(s_{0})}}((\eta-V^{\pi_{K}}(s_{1}))_% {+}-(\eta-V^{\pi^{*}}(s_{1}))_{+}))+ italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ( italic_η - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_η - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) )
(b)𝔼s0d0[|Qπ(s0,π(s0))QK(s0,π(s0))|]+𝔼s0d0[|Qπ(s0,πK(s0))QK(s0,πK(s0))|]superscript𝑏absentsubscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0superscript𝜋subscript𝑠0subscript𝑄𝐾subscript𝑠0superscript𝜋subscript𝑠0subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑄superscript𝜋subscript𝑠0subscript𝜋𝐾subscript𝑠0subscript𝑄𝐾subscript𝑠0subscript𝜋𝐾subscript𝑠0\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% |Q^{\pi^{*}}(s_{0},\pi^{*}(s_{0}))-Q_{K}(s_{0},\pi^{*}(s_{0}))|]+\mathbb{E}_{s% _{0}\sim d_{0}}[|Q^{\pi^{*}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},\pi_{K}(s_{0}))|]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) | ] + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) | ]
+γ𝔼s0d0𝔼s1Ps0,πK(s0)o(|Vπ(s1)VπK(s1)|)𝛾subscript𝔼similar-tosubscript𝑠0subscript𝑑0subscript𝔼similar-tosubscript𝑠1subscriptsuperscript𝑃𝑜subscript𝑠0subscript𝜋𝐾subscript𝑠0superscript𝑉superscript𝜋subscript𝑠1superscript𝑉subscript𝜋𝐾subscript𝑠1\displaystyle\hskip 113.81102pt+\gamma\mathbb{E}_{s_{0}\sim d_{0}}\mathbb{E}_{% s_{1}\sim P^{o}_{s_{0},\pi_{K}(s_{0})}}(|V^{\pi^{*}}(s_{1})-V^{\pi_{K}}(s_{1})|)+ italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | )
(c)h=0γh×(𝔼sdh,πK[|Qπ(s,π(s))QK(s,π(s))|+|Qπ(s,πK(s))QK(s,πK(s))|]),superscript𝑐absentsuperscriptsubscript0superscript𝛾subscript𝔼similar-to𝑠subscript𝑑subscript𝜋𝐾delimited-[]superscript𝑄superscript𝜋𝑠superscript𝜋𝑠subscript𝑄𝐾𝑠superscript𝜋𝑠superscript𝑄superscript𝜋𝑠subscript𝜋𝐾𝑠subscript𝑄𝐾𝑠subscript𝜋𝐾𝑠\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\sum_{h=0}^{\infty}\gamma^{h}% \times\bigg{(}\mathbb{E}_{s\sim d_{h,\pi_{K}}}[|Q^{\pi^{*}}(s,\pi^{*}(s))-Q_{K% }(s,\pi^{*}(s))|+|Q^{\pi^{*}}(s,\pi_{K}(s))-Q_{K}(s,\pi_{K}(s))|]\bigg{)},start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_c ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT × ( blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) | + | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) ) - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s ) ) | ] ) , (31)

where (a)𝑎(a)( italic_a ) follows from (30) and the fact |supxf(x)supxg(x)|supx|f(x)g(x)|subscriptsupremum𝑥𝑓𝑥subscriptsupremum𝑥𝑔𝑥subscriptsupremum𝑥𝑓𝑥𝑔𝑥|\sup_{x}f(x)-\sup_{x}g(x)|\leq\sup_{x}|f(x)-g(x)|| roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_f ( italic_x ) - roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_g ( italic_x ) | ≤ roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_f ( italic_x ) - italic_g ( italic_x ) |, (b)𝑏(b)( italic_b ) follows from the facts (x)+(y)+(xy)+subscript𝑥subscript𝑦subscript𝑥𝑦(x)_{+}-(y)_{+}\leq(x-y)_{+}( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ ( italic_x - italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and (x)+|x|subscript𝑥𝑥(x)_{+}\leq|x|( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ | italic_x | for any x,y𝑥𝑦x,y\in\mathbb{R}italic_x , italic_y ∈ blackboard_R. We make an important note here in step (b)𝑏(b)( italic_b ) regarding the dependence on the nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT distribution unlike in step (c)𝑐(c)( italic_c ) in the proof of Theorem 1. This important step helps us improve the concentrability assumption in further analysis. Finally, (c)𝑐(c)( italic_c ) follows with telescoping over |VπVπK|superscript𝑉superscript𝜋superscript𝑉subscript𝜋𝐾|V^{\pi^{*}}-V^{\pi_{K}}|| italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | by defining a new state distribution dh,πKΔ(𝒮)subscript𝑑subscript𝜋𝐾Δ𝒮d_{h,\pi_{K}}\in\Delta(\mathcal{S})italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ), for all natural numbers h00h\geq 0italic_h ≥ 0, as

dh,πK={d0if h=0,Ps,πK(s)ootherwise, with sdh1,πK.subscript𝑑subscript𝜋𝐾casessubscript𝑑0if h=0subscriptsuperscript𝑃𝑜superscript𝑠subscript𝜋𝐾superscript𝑠similar-tootherwise, with superscript𝑠subscript𝑑1subscript𝜋𝐾d_{h,\pi_{K}}=\begin{cases}d_{0}&\text{if $h=0$},\\ P^{o}_{s^{\prime},\pi_{K}(s^{\prime})}&\text{otherwise, with }s^{\prime}\sim d% _{h-1,\pi_{K}}.\end{cases}italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { start_ROW start_CELL italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL if italic_h = 0 , end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_CELL start_CELL otherwise, with italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_d start_POSTSUBSCRIPT italic_h - 1 , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW

For (31), with the ν𝜈\nuitalic_ν-norm notation i.e. fp,ν2=(𝔼s,aν|f(s,a)|p)1/psuperscriptsubscriptnorm𝑓𝑝𝜈2superscriptsubscript𝔼similar-to𝑠𝑎𝜈superscript𝑓𝑠𝑎𝑝1𝑝\|f\|_{p,\nu}^{2}=(\mathbb{E}_{s,a\sim\nu}|f(s,a)|^{p})^{1/p}∥ italic_f ∥ start_POSTSUBSCRIPT italic_p , italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν end_POSTSUBSCRIPT | italic_f ( italic_s , italic_a ) | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT for any νΔ(𝒮×𝒜)𝜈Δ𝒮𝒜\nu\in\Delta(\mathcal{S}\times\mathcal{A})italic_ν ∈ roman_Δ ( caligraphic_S × caligraphic_A ), we have

𝔼s0d0[Vπ]𝔼s0d0[VπK]subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉superscript𝜋subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]superscript𝑉subscript𝜋𝐾\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[{V}^{\pi^{*}}]-\mathbb{E}_{s_{0}\sim d% _{0}}[V^{\pi_{K}}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] - blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] h=0γh(QπQK1,dh,πKπ+QπQK1,dh,πKπK),absentsuperscriptsubscript0superscript𝛾subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1subscript𝑑subscript𝜋𝐾superscript𝜋subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1subscript𝑑subscript𝜋𝐾subscript𝜋𝐾\displaystyle\leq\sum_{h=0}^{\infty}\gamma^{h}\bigg{(}\|Q^{\pi^{*}}-Q_{K}\|_{1% ,d_{h,\pi_{K}}\circ\pi^{*}}+\|Q^{\pi^{*}}-Q_{K}\|_{1,d_{h,\pi_{K}}\circ\pi_{K}% }\bigg{)},≤ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,
h=0γh(2supνQπQK1,ν),absentsuperscriptsubscript0superscript𝛾2subscriptsupremum𝜈subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1𝜈\displaystyle\leq\sum_{h=0}^{\infty}\gamma^{h}(2\sup_{\nu}\|Q^{\pi^{*}}-Q_{K}% \|_{1,\nu}),≤ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( 2 roman_sup start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT ) , (32)

where the second inequality follows since both dh,πKπsubscript𝑑subscript𝜋𝐾superscript𝜋d_{h,\pi_{K}}\circ\pi^{*}italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and dh,πKπKsubscript𝑑subscript𝜋𝐾subscript𝜋𝐾d_{h,\pi_{K}}\circ\pi_{K}italic_d start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT satisfy Assumption 9. We now analyze the summand in (26):

QπQK1,νQπ𝒯QK11,ν+𝒯QK1QK1,νsubscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾1𝜈subscriptnormsuperscript𝑄superscript𝜋𝒯subscript𝑄𝐾11𝜈subscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜈\displaystyle\|Q^{\pi^{*}}-Q_{K}\|_{1,\nu}\leq\|Q^{\pi^{*}}-\mathcal{T}Q_{K-1}% \|_{1,\nu}+\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,\nu}∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT ≤ ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT + ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT
(d)Qπ𝒯QK11,ν+Ctv𝒯QK1QK1,μsuperscript𝑑absentsubscriptnormsuperscript𝑄superscript𝜋𝒯subscript𝑄𝐾11𝜈subscript𝐶tvsubscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\|Q^{\pi^{*}}-\mathcal{T}Q_{K% -1}\|_{1,\nu}+\sqrt{C_{\mathrm{tv}}}\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_d ) end_ARG end_RELOP ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν end_POSTSUBSCRIPT + square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
=(𝔼s,aν|Qπ(s,a)𝒯QK1(s,a)|)+Ctv𝒯QK1QK1,μabsentsubscript𝔼similar-to𝑠𝑎𝜈superscript𝑄superscript𝜋𝑠𝑎𝒯subscript𝑄𝐾1𝑠𝑎subscript𝐶tvsubscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle=(\mathbb{E}_{s,a\sim\nu}|Q^{\pi^{*}}(s,a)-\mathcal{T}Q_{K-1}(s,a% )|)+\sqrt{C_{\mathrm{tv}}}\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}= ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν end_POSTSUBSCRIPT | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) - caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) | ) + square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(e)(𝔼s,aνγsupη|𝔼sPs,ao((ηmaxaQK1(s,a))+(ηmaxaQπ(s,a))+)|)superscript𝑒absentsubscript𝔼similar-to𝑠𝑎𝜈𝛾subscriptsupremum𝜂subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎subscript𝜂subscriptsuperscript𝑎subscript𝑄𝐾1superscript𝑠superscript𝑎subscript𝜂subscriptsuperscript𝑎superscript𝑄superscript𝜋superscript𝑠superscript𝑎\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}(\mathbb{E}_{s,a\sim\nu}% \gamma\sup_{\eta}|\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}((\eta-\max_{a^{% \prime}}Q_{K-1}(s^{\prime},a^{\prime}))_{+}-(\eta-\max_{a^{\prime}}Q^{\pi^{*}}% (s^{\prime},a^{\prime}))_{+})|)start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_e ) end_ARG end_RELOP ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν end_POSTSUBSCRIPT italic_γ roman_sup start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) | )
+Ctv𝒯QK1QK1,μsubscript𝐶tvsubscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\hskip 170.71652pt+\sqrt{C_{\mathrm{tv}}}\|\mathcal{T}Q_{K-1}-Q_{% K}\|_{1,\mu}+ square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(f)(𝔼s,aν|𝔼sPs,ao(maxaQπ(s,a)maxaQK1(s,a))+|)+Ctv𝒯QK1QK1,μsuperscript𝑓absentsubscript𝔼similar-to𝑠𝑎𝜈subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎subscriptsubscriptsuperscript𝑎superscript𝑄superscript𝜋superscript𝑠superscript𝑎subscriptsuperscript𝑎subscript𝑄𝐾1superscript𝑠superscript𝑎subscript𝐶tvsubscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\stackrel{{\scriptstyle(f)}}{{\leq}}(\mathbb{E}_{s,a\sim\nu}|% \mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}(\max_{a^{\prime}}Q^{\pi^{*}}(s^{\prime% },a^{\prime})-\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime}))_{+}|)+\sqrt{C_{% \mathrm{tv}}}\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_f ) end_ARG end_RELOP ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ) + square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(g)γ(𝔼s,aν𝔼sPs,aomaxa|Qπ(s,a)QK1(s,a)|)+Ctv𝒯QK1QK1,μsuperscript𝑔absent𝛾subscript𝔼similar-to𝑠𝑎𝜈subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎subscriptsuperscript𝑎superscript𝑄superscript𝜋superscript𝑠superscript𝑎subscript𝑄𝐾1superscript𝑠superscript𝑎subscript𝐶tvsubscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}\gamma(\mathbb{E}_{s,a\sim\nu% }\mathbb{E}_{s^{\prime}\sim{P}^{o}_{s,a}}\max_{a^{\prime}}|Q^{\pi^{*}}(s^{% \prime},a^{\prime})-Q_{K-1}(s^{\prime},a^{\prime})|)+\sqrt{C_{\mathrm{tv}}}\|% \mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_g ) end_ARG end_RELOP italic_γ ( blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_ν end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ) + square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(h)γQπQK11,ν+Ctv𝒯QK1QK1,μsuperscriptabsent𝛾subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾11superscript𝜈subscript𝐶tvsubscriptnorm𝒯subscript𝑄𝐾1subscript𝑄𝐾1𝜇\displaystyle\stackrel{{\scriptstyle(h)}}{{\leq}}\gamma\|Q^{\pi^{*}}-Q_{K-1}\|% _{1,\nu^{\prime}}+\sqrt{C_{\mathrm{tv}}}\|\mathcal{T}Q_{K-1}-Q_{K}\|_{1,\mu}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_h ) end_ARG end_RELOP italic_γ ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT
(i)γQπQK11,ν+Ctv𝒯gK1QK1QK2,μ+Ctv𝒯QK1𝒯gK1QK11,μ,superscript𝑖absent𝛾subscriptnormsuperscript𝑄superscript𝜋subscript𝑄𝐾11superscript𝜈subscript𝐶tvsubscriptnormsubscript𝒯subscript𝑔𝐾1subscript𝑄𝐾1subscript𝑄𝐾2𝜇subscript𝐶tvsubscriptnorm𝒯subscript𝑄𝐾1subscript𝒯subscript𝑔𝐾1subscript𝑄𝐾11𝜇\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\gamma\|Q^{\pi^{*}}-Q_{K-1}\|% _{1,\nu^{\prime}}+\sqrt{C_{\mathrm{tv}}}\|\mathcal{T}_{g_{K-1}}Q_{K-1}-Q_{K}\|% _{2,\mu}+\sqrt{C_{\mathrm{tv}}}\|\mathcal{T}Q_{K-1}-\mathcal{T}_{g_{K-1}}Q_{K-% 1}\|_{1,\mu},start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP italic_γ ∥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT + square-root start_ARG italic_C start_POSTSUBSCRIPT roman_tv end_POSTSUBSCRIPT end_ARG ∥ caligraphic_T italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ,

where (d)𝑑(d)( italic_d ) follows by Assumption 9, (e)𝑒(e)( italic_e ) from Eq. 30 and the fact |supxp(x)supxq(x)|supx|p(x)q(x)|subscriptsupremum𝑥𝑝𝑥subscriptsupremum𝑥𝑞𝑥subscriptsupremum𝑥𝑝𝑥𝑞𝑥|\sup_{x}p(x)-\sup_{x}q(x)|\leq\sup_{x}|p(x)-q(x)|| roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p ( italic_x ) - roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_q ( italic_x ) | ≤ roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_p ( italic_x ) - italic_q ( italic_x ) |, (f)𝑓(f)( italic_f ) from the fact |(x)+(y)+||(xy)+|subscript𝑥subscript𝑦subscript𝑥𝑦|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|| ( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≤ | ( italic_x - italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT |, (g)𝑔(g)( italic_g ) follows by Jensen’s inequality and by the facts |supxp(x)supxq(x)|supx|p(x)q(x)|subscriptsupremum𝑥𝑝𝑥subscriptsupremum𝑥𝑞𝑥subscriptsupremum𝑥𝑝𝑥𝑞𝑥|\sup_{x}p(x)-\sup_{x}q(x)|\leq\sup_{x}|p(x)-q(x)|| roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p ( italic_x ) - roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_q ( italic_x ) | ≤ roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_p ( italic_x ) - italic_q ( italic_x ) | and (x)+|x|subscript𝑥𝑥(x)_{+}\leq|x|( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ | italic_x |, (h)(h)( italic_h ) follows by defining the distribution νsuperscript𝜈\nu^{\prime}italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as ν(s,a)=s,aν(s,a)Ps,ao(s)𝟙{a=argmaxb|Qπ(s,b)QK1(s,b)|}superscript𝜈superscript𝑠superscript𝑎subscript𝑠𝑎𝜈𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎superscript𝑠1superscript𝑎subscriptargmax𝑏superscript𝑄superscript𝜋superscript𝑠𝑏subscript𝑄𝐾1superscript𝑠𝑏\nu^{\prime}(s^{\prime},a^{\prime})=\sum_{s,a}\nu(s,a){P}^{o}_{s,a}(s^{\prime}% )\mathds{1}\{a^{\prime}=\operatorname*{arg\,max}_{b}|Q^{\pi^{*}}(s^{\prime},b)% -Q_{K-1}(s^{\prime},b)|\}italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ν ( italic_s , italic_a ) italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) blackboard_1 { italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_Q start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) | }, and (i)𝑖(i)( italic_i ) using the fact that 1,μ2,μ\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ≤ ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT. The rest of the proof follows similarly as in the proof of Theorem 1. ∎

Appendix E Hybrid Robust φ𝜑\varphiitalic_φ-regularized RL Results ☕☕☕☕

In this section, we set Vmax=Hsubscript𝑉𝐻V_{\max}=Hitalic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = italic_H whenever we use results from Proposition 3. We remark that we have attempted to optimize the absolute constants inside log\logroman_log factors of the performance guarantees. In the following, we use constants c1,c2,c3subscript𝑐1subscript𝑐2subscript𝑐3c_{1},c_{2},c_{3}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT from Proposition 3.

Now we provide an extension of Proposition 7 using Proposition 4 when the data comes from adaptive sampling.

Proposition 9 (Online Dual Optimization Error Bound).

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). For k{0,1,,K1}𝑘01𝐾1k\in\{0,1,\cdots,K-1\}italic_k ∈ { 0 , 1 , ⋯ , italic_K - 1 }, h{0,1,,H1}01𝐻1h\in\{0,1,\cdots,H-1\}italic_h ∈ { 0 , 1 , ⋯ , italic_H - 1 }, let ghksubscriptsuperscript𝑔𝑘g^{k}_{h}italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the dual optimization function from Algorithm 2 (Step 4) for the state-action value function Qh+1ksubscriptsuperscript𝑄𝑘1Q^{k}_{h+1}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT using samples in the dataset {𝒟hμ,𝒟h0,,𝒟hk1}subscriptsuperscript𝒟𝜇subscriptsuperscript𝒟0subscriptsuperscript𝒟𝑘1\{\mathcal{D}^{\mu}_{h},\mathcal{D}^{0}_{h},\cdots,\mathcal{D}^{k-1}_{h}\}{ caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , ⋯ , caligraphic_D start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT }. Let 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT be as defined in (14) and let N=moff+Kmon𝑁subscript𝑚off𝐾subscript𝑚onN=m_{\mathrm{off}}+K\cdot m_{\mathrm{on}}italic_N = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_K ⋅ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT. Then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

𝒯Qh+1k𝒯ghkQh+1k1,μh1moff(3ε𝒢N+48c1log(2HK|𝒢|||/δ))=Δdual,offandformulae-sequencesubscriptnorm𝒯subscriptsuperscript𝑄𝑘1subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘11subscript𝜇1subscript𝑚off3subscript𝜀𝒢𝑁48subscript𝑐12𝐻𝐾𝒢𝛿subscriptΔdualoffand\displaystyle\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}\|_{1,% \mu_{h}}\leq\frac{1}{m_{\mathrm{off}}}\left(3\varepsilon_{\mathcal{G}}N+48c_{1% }\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)\right)=\Delta_{\mathrm{dual,off}}% \quad\text{and}∥ caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG ( 3 italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT italic_N + 48 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) ) = roman_Δ start_POSTSUBSCRIPT roman_dual , roman_off end_POSTSUBSCRIPT and
τ=0k1𝒯Qh+1k𝒯ghkQh+1k1,dhπτ1mon(3ε𝒢N+48c1log(2HK|𝒢|||/δ))=Δdual,on.superscriptsubscript𝜏0𝑘1subscriptnorm𝒯subscriptsuperscript𝑄𝑘1subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘11superscriptsubscript𝑑subscript𝜋𝜏1subscript𝑚on3subscript𝜀𝒢𝑁48subscript𝑐12𝐻𝐾𝒢𝛿subscriptΔdualon\displaystyle\sum_{\tau=0}^{k-1}\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h% }}Q^{k}_{h+1}\|_{1,d_{h}^{\pi_{\tau}}}\leq\frac{1}{m_{\mathrm{on}}}\left(3% \varepsilon_{\mathcal{G}}N+48c_{1}\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)% \right)=\Delta_{\mathrm{dual,on}}.∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_ARG ( 3 italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT italic_N + 48 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) ) = roman_Δ start_POSTSUBSCRIPT roman_dual , roman_on end_POSTSUBSCRIPT .
Proof.

Fix k{0,1,,K1}𝑘01𝐾1k\in\{0,1,\cdots,K-1\}italic_k ∈ { 0 , 1 , ⋯ , italic_K - 1 }, h{0,1,,H1}01𝐻1h\in\{0,1,\cdots,H-1\}italic_h ∈ { 0 , 1 , ⋯ , italic_H - 1 }, Qh+1kh+1subscriptsuperscript𝑄𝑘1subscript1Q^{k}_{h+1}\in\mathcal{F}_{h+1}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT. The algorithm solves for ghksubscriptsuperscript𝑔𝑘g^{k}_{h}italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT in the empirical risk minimization step as:

ghk=argming𝒢hL^dual(g;Qh+1k,𝒟),subscriptsuperscript𝑔𝑘subscriptargmin𝑔subscript𝒢subscript^𝐿dual𝑔subscriptsuperscript𝑄𝑘1𝒟\displaystyle g^{k}_{h}=\operatorname*{arg\,min}_{g\in\mathcal{G}_{h}}\widehat% {L}_{\mathrm{dual}}(g;Q^{k}_{h+1},\mathcal{D}),italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_dual end_POSTSUBSCRIPT ( italic_g ; italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , caligraphic_D ) ,

where dataset 𝒟={(shi,ahi,sh+1i)}iN𝒟subscriptsubscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖subscriptsuperscript𝑠𝑖1𝑖𝑁\mathcal{D}=\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i\leq N}caligraphic_D = { ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ≤ italic_N end_POSTSUBSCRIPT with N=moff+kmon𝑁subscript𝑚off𝑘subscript𝑚onN=m_{\mathrm{off}}+k\cdot m_{\mathrm{on}}italic_N = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_k ⋅ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT. The first moffsubscript𝑚offm_{\mathrm{off}}italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT samples in 𝒟𝒟\mathcal{D}caligraphic_D are {(shi,ahi,sh+1i)}imoff=𝒟hμsubscriptsubscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖subscriptsuperscript𝑠𝑖1𝑖subscript𝑚offsubscriptsuperscript𝒟𝜇\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i\leq m_{\mathrm{off}}}=\mathcal{D}^{% \mu}_{h}{ ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ≤ italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (recall that these are generated by the offline state-action distribution μhsubscript𝜇\mu_{h}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT), the next monsubscript𝑚onm_{\mathrm{on}}italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT samples are {(shi,ahi,sh+1i)}i=moff+1moff+mon=𝒟h0superscriptsubscriptsubscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖subscriptsuperscript𝑠𝑖1𝑖subscript𝑚off1subscript𝑚offsubscript𝑚onsubscriptsuperscript𝒟0\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}% +m_{\mathrm{on}}}=\mathcal{D}^{0}_{h}{ ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (recall that these are generated by the state-action distribution dhπ0superscriptsubscript𝑑subscript𝜋0d_{h}^{\pi_{0}}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT), and so on where the samples {(shi,ahi,sh+1i)}i=moff+τmon+1moff+(τ+1)mon=𝒟hτsuperscriptsubscriptsubscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖subscriptsuperscript𝑠𝑖1𝑖subscript𝑚off𝜏subscript𝑚on1subscript𝑚off𝜏1subscript𝑚onsubscriptsuperscript𝒟𝜏\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i=m_{\mathrm{off}}+\tau\cdot m_{\mathrm% {on}}+1}^{m_{\mathrm{off}}+(\tau+1)m_{\mathrm{on}}}=\mathcal{D}^{\tau}_{h}{ ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_τ ⋅ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + ( italic_τ + 1 ) italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (recall that these are generated by the state-action distribution dhπτsuperscriptsubscript𝑑subscript𝜋𝜏d_{h}^{\pi_{\tau}}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) for all τk1𝜏𝑘1\tau\leq k-1italic_τ ≤ italic_k - 1. We first have the following from step (b) in the proof of Proposition 7:

moffsubscript𝑚off\displaystyle m_{\mathrm{off}}italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT 𝒯Qh+1k𝒯ghkQh+1k1,μ+monτ=0k1𝒯Qh+1k𝒯ghkQh+1k1,dhπτsubscriptnorm𝒯subscriptsuperscript𝑄𝑘1subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘11𝜇subscript𝑚onsuperscriptsubscript𝜏0𝑘1subscriptnorm𝒯subscriptsuperscript𝑄𝑘1subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘11superscriptsubscript𝑑subscript𝜋𝜏\displaystyle\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}\|_{1,% \mu}+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g% ^{k}_{h}}Q^{k}_{h+1}\|_{1,d_{h}^{\pi_{\tau}}}∥ caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
=moff[𝔼s,aμh,sPs,ao(λφ((ghk(s,a)maxaQh+1k(s,a))/λ)ghk(s,a))\displaystyle=m_{\mathrm{off}}[\mathbb{E}_{s,a\sim\mu_{h},s^{\prime}\sim P^{o}% _{s,a}}(\lambda\varphi^{*}({(g^{k}_{h}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})-g^{k}_{h}(s,a))= italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) )
infgL1(μh)𝔼s,aμh,sPs,ao(λφ((g(s,a)maxaQh+1k(s,a))/λ)g(s,a))]\displaystyle\hskip 28.45274pt-\inf_{g\in L^{1}(\mu_{h})}\mathbb{E}_{s,a\sim% \mu_{h},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{% \prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g(s,a))]- roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ) ]
+monτ=0k1[𝔼s,adhπτ,sPs,ao(λφ((ghk(s,a)maxaQh+1k(s,a))/λ)ghk(s,a))\displaystyle\hskip 28.45274pt+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}[\mathbb{E}_{% s,a\sim d_{h}^{\pi_{\tau}},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g^% {k}_{h}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-% g^{k}_{h}(s,a))+ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) )
infgL1(dhπτ)𝔼s,adhπτ,sPs,ao(λφ((g(s,a)maxaQh+1k(s,a))/λ)g(s,a))]\displaystyle\hskip 28.45274pt-\inf_{g\in L^{1}(d_{h}^{\pi_{\tau}})}\mathbb{E}% _{s,a\sim d_{h}^{\pi_{\tau}},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(% g(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g(s,a))]- roman_inf start_POSTSUBSCRIPT italic_g ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g ( italic_s , italic_a ) ) ]
=(a)moff[𝔼s,aμh,sPs,ao(λφ((ghk(s,a)maxaQh+1k(s,a))/λ)ghk(s,a))\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}m_{\mathrm{off}}[\mathbb{E}_{s,a% \sim\mu_{h},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g^{k}_{h}(s,a)-% \max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g^{k}_{h}(s,a))start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) )
𝔼s,aμh,sPs,ao(λφ((g1(s,a)maxaQh+1k(s,a))/λ)g1(s,a))]\displaystyle\hskip 28.45274pt-\mathbb{E}_{s,a\sim\mu_{h},s^{\prime}\sim P^{o}% _{s,a}}(\lambda\varphi^{*}({(g^{*}_{-1}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})-g^{*}_{-1}(s,a))]- blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ]
+monτ=0k1[𝔼s,adhπτ,sPs,ao(λφ((ghk(s,a)maxaQh+1k(s,a))/λ)ghk(s,a))\displaystyle\hskip 28.45274pt+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}[\mathbb{E}_{% s,a\sim d_{h}^{\pi_{\tau}},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g^% {k}_{h}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-% g^{k}_{h}(s,a))+ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) )
𝔼s,adhπτ,sPs,ao(λφ((gτ(s,a)maxaQh+1k(s,a))/λ)gτ(s,a))]\displaystyle\hskip 28.45274pt-\mathbb{E}_{s,a\sim d_{h}^{\pi_{\tau}},s^{% \prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g^{*}_{\tau}(s,a)-\max_{a^{% \prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g^{*}_{\tau}(s,a))]- blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ]
=i=1moff𝔼shi,ahiμh,sh+1iPshi,ahio[(λφ((ghk(shi,ahi)maxaQh+1k(sh+1i,a))/λ)ghk(shi,ahi))\displaystyle=\sum_{i=1}^{m_{\mathrm{off}}}\mathbb{E}_{s^{i}_{h},a^{i}_{h}\sim% \mu_{h},s^{i}_{h+1}\sim P^{o}_{s^{i}_{h},a^{i}_{h}}}[(\lambda\varphi^{*}({(g^{% k}_{h}(s^{i}_{h},a^{i}_{h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime% }))}/{\lambda})-g^{k}_{h}(s^{i}_{h},a^{i}_{h}))= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )
(λφ((g1(shi,ahi)maxaQh+1k(sh+1i,a))/λ)g1(shi,ahi))]\displaystyle\hskip 28.45274pt-(\lambda\varphi^{*}({(g^{*}_{-1}(s^{i}_{h},a^{i% }_{h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})-g^{*}% _{-1}(s^{i}_{h},a^{i}_{h}))]- ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ]
+i=moff+1moff+mon𝔼shi,ahidhπ0,sh+1iPshi,ahio[(λφ((ghk(shi,ahi)maxaQh+1k(sh+1i,a))/λ)ghk(shi,ahi))\displaystyle\hskip 28.45274pt+\sum_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}+m% _{\mathrm{on}}}\mathbb{E}_{s^{i}_{h},a^{i}_{h}\sim d_{h}^{\pi_{0}},s^{i}_{h+1}% \sim P^{o}_{s^{i}_{h},a^{i}_{h}}}[(\lambda\varphi^{*}({(g^{k}_{h}(s^{i}_{h},a^% {i}_{h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})-g^{% k}_{h}(s^{i}_{h},a^{i}_{h}))+ ∑ start_POSTSUBSCRIPT italic_i = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )
(λφ((g0(shi,ahi)maxaQh+1k(sh+1i,a))/λ)g0(shi,ahi))]\displaystyle\hskip 56.9055pt-(\lambda\varphi^{*}({(g^{*}_{0}(s^{i}_{h},a^{i}_% {h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})-g^{*}_{% 0}(s^{i}_{h},a^{i}_{h}))]- ( italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ]
+\displaystyle\hskip 56.9055pt+\cdots+ ⋯
(b)3ε𝒢N+48c1log(2|𝒢|||/δ),superscript𝑏absent3subscript𝜀𝒢𝑁48subscript𝑐12𝒢𝛿\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}3\varepsilon_{\mathcal{G}}N+4% 8c_{1}\log(2|\mathcal{G}||\mathcal{F}|/\delta),start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP 3 italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT italic_N + 48 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log ( 2 | caligraphic_G | | caligraphic_F | / italic_δ ) ,

where (a)𝑎(a)( italic_a ) follows by defining the corresponding true solutions gτsubscriptsuperscript𝑔𝜏g^{*}_{\tau}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT for all τ{1,0,1,,k1}𝜏101𝑘1\tau\in\{-1,0,1,\cdots,k-1\}italic_τ ∈ { - 1 , 0 , 1 , ⋯ , italic_k - 1 }. For (b)𝑏(b)( italic_b ) with the empirical risk minimization solution ghksubscriptsuperscript𝑔𝑘g^{k}_{h}italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, we use Proposition 4 by setting c=c1𝑐subscript𝑐1c=c_{1}italic_c = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (with c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, constant dependent on H𝐻Hitalic_H and λ𝜆\lambdaitalic_λ, from Proposition 3) and since ghk𝒢h,Qh+1kh+1formulae-sequencesubscriptsuperscript𝑔𝑘subscript𝒢subscriptsuperscript𝑄𝑘1subscript1g^{k}_{h}\in\mathcal{G}_{h},Q^{k}_{h+1}\in\mathcal{F}_{h+1}italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT with sizes |𝒢h||𝒢|subscript𝒢𝒢|\mathcal{G}_{h}|\leq|\mathcal{G}|| caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | ≤ | caligraphic_G | and |h+1|||subscript1|\mathcal{F}_{h+1}|\leq|\mathcal{F}|| caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT | ≤ | caligraphic_F | under the union bound. Taking a union bound over k{0,1,,K1}𝑘01𝐾1k\in\{0,1,\cdots,K-1\}italic_k ∈ { 0 , 1 , ⋯ , italic_K - 1 }, h{0,1,,H1}01𝐻1h\in\{0,1,\cdots,H-1\}italic_h ∈ { 0 , 1 , ⋯ , italic_H - 1 }, and bounding each term separately, completes the proof. ∎

Now we provide an extension of Proposition 8 using Lemma 7 when the data comes from adaptive sampling.

Proposition 10 (Online Least-squares Generalization Bound).

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). For k{0,1,,K1}𝑘01𝐾1k\in\{0,1,\cdots,K-1\}italic_k ∈ { 0 , 1 , ⋯ , italic_K - 1 }, h{0,1,,H1}01𝐻1h\in\{0,1,\cdots,H-1\}italic_h ∈ { 0 , 1 , ⋯ , italic_H - 1 }, let Qhksubscriptsuperscript𝑄𝑘Q^{k}_{h}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the least-squares solution from Algorithm 2 (Step 5) for the state-action value function Qh+1ksubscriptsuperscript𝑄𝑘1Q^{k}_{h+1}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT and dual variable function ghksubscriptsuperscript𝑔𝑘g^{k}_{h}italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT using samples in the dataset {𝒟hμ,𝒟h0,,𝒟hk1}subscriptsuperscript𝒟𝜇subscriptsuperscript𝒟0subscriptsuperscript𝒟𝑘1\{\mathcal{D}^{\mu}_{h},\mathcal{D}^{0}_{h},\cdots,\mathcal{D}^{k-1}_{h}\}{ caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , ⋯ , caligraphic_D start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT }. Let 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT be as defined in (14) and let N=moff+Kmon𝑁subscript𝑚off𝐾subscript𝑚onN=m_{\mathrm{off}}+K\cdot m_{\mathrm{on}}italic_N = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_K ⋅ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT. Then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

𝒯ghkQh+1kQhk2,μh1moff(3ε,rN+8(1+c1+H)log(2HK|𝒢|||/δ))=ΔrQ,offandformulae-sequencesubscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2subscript𝜇1subscript𝑚off3subscript𝜀r𝑁81subscript𝑐1𝐻2𝐻𝐾𝒢𝛿subscriptΔrQoffand\displaystyle\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|_{2,\mu_{h}}\leq% \frac{1}{\sqrt{m_{\mathrm{off}}}}\left(\sqrt{3\varepsilon_{\mathcal{F},\mathrm% {r}}N}+8(1+c_{1}+H)\sqrt{\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)}\right)=% \Delta_{\mathrm{rQ,off}}\quad\text{and}∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG end_ARG ( square-root start_ARG 3 italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT italic_N end_ARG + 8 ( 1 + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H ) square-root start_ARG roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) end_ARG ) = roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_off end_POSTSUBSCRIPT and
τ=0k1𝒯ghkQh+1kQhk2,dhπτ21mon(3ε,rN+8(1+c1+H)log(2HK|𝒢|||/δ))=ΔrQ,on.superscriptsubscript𝜏0𝑘1superscriptsubscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2superscriptsubscript𝑑subscript𝜋𝜏21subscript𝑚on3subscript𝜀r𝑁81subscript𝑐1𝐻2𝐻𝐾𝒢𝛿subscriptΔrQon\displaystyle\sqrt{\sum_{\tau=0}^{k-1}\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{% k}_{h}\|_{2,d_{h}^{\pi_{\tau}}}^{2}}\leq\frac{1}{\sqrt{m_{\mathrm{on}}}}\left(% \sqrt{3\varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+c_{1}+H)\sqrt{\log(2HK|% \mathcal{G}||\mathcal{F}|/\delta)}\right)=\Delta_{\mathrm{rQ,on}}.square-root start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_ARG end_ARG ( square-root start_ARG 3 italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT italic_N end_ARG + 8 ( 1 + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H ) square-root start_ARG roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) end_ARG ) = roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_on end_POSTSUBSCRIPT .
Proof.

We adapt the proof of Song et al., (2023, Lemma 7) here. Fix k{0,1,,K1}𝑘01𝐾1k\in\{0,1,\cdots,K-1\}italic_k ∈ { 0 , 1 , ⋯ , italic_K - 1 }, h{0,1,,H1}01𝐻1h\in\{0,1,\cdots,H-1\}italic_h ∈ { 0 , 1 , ⋯ , italic_H - 1 }, ghk𝒢hsubscriptsuperscript𝑔𝑘subscript𝒢g^{k}_{h}\in\mathcal{G}_{h}italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and Qh+1kh+1subscriptsuperscript𝑄𝑘1subscript1Q^{k}_{h+1}\in\mathcal{F}_{h+1}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT. The algorithm solves for Qhksubscriptsuperscript𝑄𝑘Q^{k}_{h}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT in the least-squares regression step as:

Qhk=argminQhL^robQ(Q;Qh+1k,ghk,𝒟),subscriptsuperscript𝑄𝑘subscriptargmin𝑄subscriptsubscript^𝐿robQ𝑄subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑔𝑘𝒟\displaystyle Q^{k}_{h}=\operatorname*{arg\,min}_{Q\in\mathcal{F}_{h}}\widehat% {L}_{\mathrm{robQ}}(Q;Q^{k}_{h+1},g^{k}_{h},\mathcal{D}),italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_Q ∈ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT roman_robQ end_POSTSUBSCRIPT ( italic_Q ; italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_D ) ,

where dataset 𝒟={(xi,yi)}iN𝒟subscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖𝑁\mathcal{D}=\{(x_{i},y_{i})\}_{i\leq N}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ≤ italic_N end_POSTSUBSCRIPT with N=moff+kmon𝑁subscript𝑚off𝑘subscript𝑚onN=m_{\mathrm{off}}+k\cdot m_{\mathrm{on}}italic_N = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_k ⋅ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT and

xi=(shi,ahi)andyi=rh(shi,ahi)λφ((ghk(shi,ahi)maxaQh+1k(sh+1i,a))/λ)+ghk(shi,ahi).formulae-sequencesubscript𝑥𝑖subscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖andsubscript𝑦𝑖subscript𝑟subscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖𝜆superscript𝜑subscriptsuperscript𝑔𝑘subscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖subscriptsuperscript𝑎subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑠𝑖1superscript𝑎𝜆subscriptsuperscript𝑔𝑘subscriptsuperscript𝑠𝑖subscriptsuperscript𝑎𝑖\displaystyle x_{i}=(s^{i}_{h},a^{i}_{h})\qquad\text{and}\qquad y_{i}=r_{h}(s^% {i}_{h},a^{i}_{h})-\lambda\varphi^{*}({(g^{k}_{h}(s^{i}_{h},a^{i}_{h})-\max_{a% ^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})+g^{k}_{h}(s^{i}_{h}% ,a^{i}_{h}).italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) + italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) .

The first moffsubscript𝑚offm_{\mathrm{off}}italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT samples in 𝒟𝒟\mathcal{D}caligraphic_D are {(xi,yi)}imoff=𝒟hμsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖subscript𝑚offsubscriptsuperscript𝒟𝜇\{(x_{i},y_{i})\}_{i\leq m_{\mathrm{off}}}=\mathcal{D}^{\mu}_{h}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ≤ italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (recall that these are generated by the offline state-action distribution μhsubscript𝜇\mu_{h}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT), the next monsubscript𝑚onm_{\mathrm{on}}italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT samples are {(xi,yi)}i=moff+1moff+mon=𝒟h0superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖subscript𝑚off1subscript𝑚offsubscript𝑚onsubscriptsuperscript𝒟0\{(x_{i},y_{i})\}_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}+m_{\mathrm{on}}}=% \mathcal{D}^{0}_{h}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (recall that these are generated by the state-action distribution dhπ0superscriptsubscript𝑑subscript𝜋0d_{h}^{\pi_{0}}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT), and so on where the samples {(xi,yi)}i=moff+τmon+1moff+(τ+1)mon=𝒟hτsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖subscript𝑚off𝜏subscript𝑚on1subscript𝑚off𝜏1subscript𝑚onsubscriptsuperscript𝒟𝜏\{(x_{i},y_{i})\}_{i=m_{\mathrm{off}}+\tau\cdot m_{\mathrm{on}}+1}^{m_{\mathrm% {off}}+(\tau+1)m_{\mathrm{on}}}=\mathcal{D}^{\tau}_{h}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_τ ⋅ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + ( italic_τ + 1 ) italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (recall that these are generated by the state-action distribution dhπτsuperscriptsubscript𝑑subscript𝜋𝜏d_{h}^{\pi_{\tau}}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) for all τk1𝜏𝑘1\tau\leq k-1italic_τ ≤ italic_k - 1.

For using Lemma 7, we first note for any sample (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) in 𝒟𝒟\mathcal{D}caligraphic_D with x=(sh,ah)𝑥subscript𝑠subscript𝑎x=(s_{h},a_{h})italic_x = ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and y=(rh(sh,ah)λφ((ghk(sh,ah)maxa𝒜h+1Qh+1k(sh+1,a))/λ)+ghk(sh,ah))𝑦subscript𝑟subscript𝑠subscript𝑎𝜆superscript𝜑subscriptsuperscript𝑔𝑘subscript𝑠subscript𝑎subscriptsuperscript𝑎subscript𝒜1subscriptsuperscript𝑄𝑘1subscript𝑠1superscript𝑎𝜆subscriptsuperscript𝑔𝑘subscript𝑠subscript𝑎y=(r_{h}(s_{h},a_{h})-\lambda\varphi^{*}({(g^{k}_{h}(s_{h},a_{h})-\max_{a^{% \prime}\in\mathcal{A}_{h+1}}Q^{k}_{h+1}(s_{h+1},a^{\prime}))}/{\lambda})+g^{k}% _{h}(s_{h},a_{h}))italic_y = ( italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) + italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ), there exists some fh+1h+1subscript𝑓1subscript1f_{h+1}\in\mathcal{F}_{h+1}italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT by Assumption 5 such that the following holds:

𝔼[yx]𝔼delimited-[]conditional𝑦𝑥\displaystyle\mathbb{E}[y\mid x]blackboard_E [ italic_y ∣ italic_x ] =𝔼sh+1Ph,sh,aho(rh(sh,ah)λφ((ghk(sh,ah)maxa𝒜h+1Qh+1k(sh+1,a))/λ)+ghk(sh,ah))absentsubscript𝔼similar-tosubscript𝑠1subscriptsuperscript𝑃𝑜subscript𝑠subscript𝑎subscript𝑟subscript𝑠subscript𝑎𝜆superscript𝜑subscriptsuperscript𝑔𝑘subscript𝑠subscript𝑎subscriptsuperscript𝑎subscript𝒜1subscriptsuperscript𝑄𝑘1subscript𝑠1superscript𝑎𝜆subscriptsuperscript𝑔𝑘subscript𝑠subscript𝑎\displaystyle=\mathbb{E}_{s_{h+1}\sim P^{o}_{h,s_{h},a_{h}}}(r_{h}(s_{h},a_{h}% )-\lambda\varphi^{*}({(g^{k}_{h}(s_{h},a_{h})-\max_{a^{\prime}\in\mathcal{A}_{% h+1}}Q^{k}_{h+1}(s_{h+1},a^{\prime}))}/{\lambda})+g^{k}_{h}(s_{h},a_{h}))= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_λ italic_φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ( italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_λ ) + italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )
=𝒯ghkQh+1k(sh,ah)fh+1(sh,ah).absentsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscript𝑠subscript𝑎subscript𝑓1subscript𝑠subscript𝑎\displaystyle=\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}(s_{h},a_{h})\leq f_{h+1}(s_{h% },a_{h}).= caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ≤ italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) .

We also note for any sample in 𝒟𝒟\mathcal{D}caligraphic_D, |y|1+c1𝑦1subscript𝑐1|y|\leq 1+c_{1}| italic_y | ≤ 1 + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (with c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, constant dependent on H𝐻Hitalic_H and λ𝜆\lambdaitalic_λ, from Proposition 3) and fh+1(s,a)Hsubscript𝑓1𝑠𝑎𝐻f_{h+1}(s,a)\leq Hitalic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) ≤ italic_H for all s,a𝑠𝑎s,aitalic_s , italic_a. With these notes, applying Lemma 7, we get that the least square regression solution Qhksubscriptsuperscript𝑄𝑘Q^{k}_{h}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT satisfies

i=1N𝔼[(𝒯ghkQh+1k(xi)Qhk(xi))2𝒟]superscriptsubscript𝑖1𝑁𝔼delimited-[]conditionalsuperscriptsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscript𝑥𝑖subscriptsuperscript𝑄𝑘subscript𝑥𝑖2𝒟\displaystyle\sum_{i=1}^{N}\mathbb{E}[(\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}(x_{i% })-Q^{k}_{h}(x_{i}))^{2}\mid\mathcal{D}]∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E [ ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_D ] 3ε,rN+64(1+c1+H)2log(2|𝒢|||/δ)absent3subscript𝜀r𝑁64superscript1subscript𝑐1𝐻22𝒢𝛿\displaystyle\leq 3\varepsilon_{\mathcal{F},\mathrm{r}}N+64(1+c_{1}+H)^{2}\log% (2|\mathcal{G}||\mathcal{F}|/\delta)≤ 3 italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT italic_N + 64 ( 1 + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 2 | caligraphic_G | | caligraphic_F | / italic_δ )

with probability at least 1δ1𝛿1-\delta1 - italic_δ, since ghk𝒢hsubscriptsuperscript𝑔𝑘subscript𝒢g^{k}_{h}\in\mathcal{G}_{h}italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Qh+1kh+1subscriptsuperscript𝑄𝑘1subscript1Q^{k}_{h+1}\in\mathcal{F}_{h+1}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT with sizes |𝒢h||𝒢|subscript𝒢𝒢|\mathcal{G}_{h}|\leq|\mathcal{G}|| caligraphic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | ≤ | caligraphic_G | and |h+1|||subscript1|\mathcal{F}_{h+1}|\leq|\mathcal{F}|| caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT | ≤ | caligraphic_F | under the union bound. Recall the samples in 𝒟hμsubscriptsuperscript𝒟𝜇\mathcal{D}^{\mu}_{h}caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are independently and identically drawn from the offline distribution μhsubscript𝜇\mu_{h}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and the samples in 𝒟hτsubscriptsuperscript𝒟𝜏\mathcal{D}^{\tau}_{h}caligraphic_D start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are independently and identically drawn from the state-action distribution dhπτsuperscriptsubscript𝑑subscript𝜋𝜏d_{h}^{\pi_{\tau}}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Thus we can further write as

moff𝒯ghkQh+1kQhk2,μ2+monτ=0k1𝒯ghkQh+1kQhk2,dhπτ2subscript𝑚offsuperscriptsubscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2𝜇2subscript𝑚onsuperscriptsubscript𝜏0𝑘1superscriptsubscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2superscriptsubscript𝑑subscript𝜋𝜏2\displaystyle m_{\mathrm{off}}\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|% _{2,\mu}^{2}+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}\|\mathcal{T}_{g^{k}_{h}}Q^{k}_% {h+1}-Q^{k}_{h}\|_{2,d_{h}^{\pi_{\tau}}}^{2}italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 3ε,rN+64(1+c1+H)2log(2|𝒢|||/δ).absent3subscript𝜀r𝑁64superscript1subscript𝑐1𝐻22𝒢𝛿\displaystyle\leq 3\varepsilon_{\mathcal{F},\mathrm{r}}N+64(1+c_{1}+H)^{2}\log% (2|\mathcal{G}||\mathcal{F}|/\delta).≤ 3 italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT italic_N + 64 ( 1 + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 2 | caligraphic_G | | caligraphic_F | / italic_δ ) .

Taking a union bound over k{0,1,,K1}𝑘01𝐾1k\in\{0,1,\cdots,K-1\}italic_k ∈ { 0 , 1 , ⋯ , italic_K - 1 }, h{0,1,,H1}01𝐻1h\in\{0,1,\cdots,H-1\}italic_h ∈ { 0 , 1 , ⋯ , italic_H - 1 }, bounding each term separately, and using the fact x+yx+y𝑥𝑦𝑥𝑦\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}square-root start_ARG italic_x + italic_y end_ARG ≤ square-root start_ARG italic_x end_ARG + square-root start_ARG italic_y end_ARG, completes the proof. ∎

We are now ready to prove the main theorem.

E.1 Proof of Theorem 2 ☕☕☕☕

Theorem 5 (Restatement of Theorem 2).

Let Assumptions 4, 5, 6, 7 and 8 hold and fix any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). Then, HyTQ algorithm policies {πk}k[K]subscriptsubscript𝜋𝑘𝑘delimited-[]𝐾\{\pi_{k}\}_{k\in[K]}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT satisfy

k=0K1(VπVπk)superscriptsubscript𝑘0𝐾1superscript𝑉superscript𝜋superscript𝑉subscript𝜋𝑘absent\displaystyle\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ 𝒪((ε,r+ε𝒢)K5/2H)𝒪subscript𝜀rsubscript𝜀𝒢superscript𝐾52𝐻\displaystyle\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)caligraphic_O ( ( square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG + italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT italic_H )
+𝒪~(max{C(π),1}dKH2(λ+H)log(HK|||𝒢|/δ)log(1+(K/d)))~𝒪𝐶superscript𝜋1𝑑𝐾superscript𝐻2𝜆𝐻𝐻𝐾𝒢𝛿1𝐾𝑑\displaystyle+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dKH^{2}}(% \lambda+H)\log(HK|\mathcal{F}||\mathcal{G}|/\delta)\sqrt{\log(1+(K/d))})+ over~ start_ARG caligraphic_O end_ARG ( roman_max { italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , 1 } square-root start_ARG italic_d italic_K italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_λ + italic_H ) roman_log ( italic_H italic_K | caligraphic_F | | caligraphic_G | / italic_δ ) square-root start_ARG roman_log ( 1 + ( italic_K / italic_d ) ) end_ARG )

with probability at least 1δ1𝛿1-\delta1 - italic_δ.

Proof.

We let Vhk(s)=Qhk(s,πk(s))subscriptsuperscript𝑉𝑘𝑠subscriptsuperscript𝑄𝑘𝑠subscript𝜋𝑘𝑠V^{k}_{h}(s)=Q^{k}_{h}(s,\pi_{k}(s))italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ) for every s,h𝑠s,hitalic_s , italic_h. Since πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the greedy policy w.r.t Qksuperscript𝑄𝑘Q^{k}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we also have Vhk(s)=Qhk(s,πk(s))=maxaQhk(s,a)subscriptsuperscript𝑉𝑘𝑠subscriptsuperscript𝑄𝑘𝑠subscript𝜋𝑘𝑠subscript𝑎subscriptsuperscript𝑄𝑘𝑠𝑎V^{k}_{h}(s)=Q^{k}_{h}(s,\pi_{k}(s))=\max_{a}Q^{k}_{h}(s,a)italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ). We recall that V=Vπsuperscript𝑉superscript𝑉superscript𝜋V^{*}=V^{\pi^{*}}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and Q=Qπsuperscript𝑄superscript𝑄superscript𝜋Q^{*}=Q^{\pi^{*}}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. We also note that the same holds true for any stationary Markov policy π𝜋\piitalic_π from (Zhang et al.,, 2023) that Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT satisfies Qhπ(s,a)=rh(s,a)+γminPh,s,aPh,s,ao(𝔼sPh,s,a[Vhπ(s)]+λDφ(Ph,s,a,Ph,s,ao)).subscriptsuperscript𝑄𝜋𝑠𝑎subscript𝑟𝑠𝑎𝛾subscriptmuch-less-thansubscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑉𝜋superscript𝑠𝜆subscript𝐷𝜑subscript𝑃𝑠𝑎subscriptsuperscript𝑃𝑜𝑠𝑎Q^{\pi}_{h}(s,a)=r_{h}(s,a)+\gamma\min_{P_{h,s,a}\ll P^{o}_{h,s,a}}(\mathbb{E}% _{s^{\prime}\sim P_{h,s,a}}[V^{\pi}_{h}(s^{\prime})]+\lambda D_{\varphi}(P_{h,% s,a},P^{o}_{h,s,a})).italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ≪ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_λ italic_D start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ) ) . We can now further use the dual form (4) under Assumption 8, that is, for all π𝜋\piitalic_π and fh+1h+1subscript𝑓1subscript1f_{h+1}\in\mathcal{F}_{h+1}italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT,

Qhπ(s,a)subscriptsuperscript𝑄𝜋𝑠𝑎\displaystyle Q^{\pi}_{h}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) =rh(s,a)infη[0,λ](𝔼sPh,s,ao[(ηVh+1π(s))+]η), andabsentsubscript𝑟𝑠𝑎subscriptinfimum𝜂0𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]subscript𝜂subscriptsuperscript𝑉𝜋1superscript𝑠𝜂 and\displaystyle=r_{h}(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}% \sim P^{o}_{h,s,a}}[(\eta-V^{\pi}_{h+1}(s^{\prime}))_{+}]-\eta),\text{ and}= italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_inf start_POSTSUBSCRIPT italic_η ∈ [ 0 , italic_λ ] end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_η - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η ) , and (33)
(𝒯fh+1)(s,a)𝒯subscript𝑓1𝑠𝑎\displaystyle(\mathcal{T}f_{h+1})(s,a)( caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) =rh(s,a)infη[0,λ](𝔼sPh,s,ao[(ηmaxafh+1(s,a))+]η)absentsubscript𝑟𝑠𝑎subscriptinfimum𝜂0𝜆subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]subscript𝜂subscriptsuperscript𝑎subscript𝑓1superscript𝑠superscript𝑎𝜂\displaystyle=r_{h}(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}% \sim P^{o}_{h,s,a}}[(\eta-\max_{a^{\prime}}f_{h+1}(s^{\prime},a^{\prime}))_{+}% ]-\eta)= italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_inf start_POSTSUBSCRIPT italic_η ∈ [ 0 , italic_λ ] end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - italic_η )
(𝒯ghfh+1)(s,a)subscript𝒯subscript𝑔subscript𝑓1𝑠𝑎\displaystyle(\mathcal{T}_{g_{h}}f_{h+1})(s,a)( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) =rh(s,a)𝔼sPh,s,ao[(gh(s,a)maxafh+1(s,a))+]+gh(s,a).absentsubscript𝑟𝑠𝑎subscript𝔼similar-tosuperscript𝑠subscriptsuperscript𝑃𝑜𝑠𝑎delimited-[]subscriptsubscript𝑔𝑠𝑎subscriptsuperscript𝑎subscript𝑓1superscript𝑠superscript𝑎subscript𝑔𝑠𝑎\displaystyle=r_{h}(s,a)-\mathbb{E}_{s^{\prime}\sim P^{o}_{h,s,a}}[(g_{h}(s,a)% -\max_{a^{\prime}}f_{h+1}(s^{\prime},a^{\prime}))_{+}]+g_{h}(s,a).= italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] + italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) .

We first characterize the performance decomposition between V0πsubscriptsuperscript𝑉superscript𝜋0V^{\pi^{*}}_{0}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and V0πksubscriptsuperscript𝑉subscript𝜋𝑘0{V}^{\pi_{k}}_{0}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We recall the initial state distribution d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Since Vπ(s)Vπk(s)superscript𝑉superscript𝜋𝑠superscript𝑉subscript𝜋𝑘𝑠V^{\pi^{*}}(s)\geq V^{\pi_{k}}(s)italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s ) for any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, we observe that

00absent\displaystyle 0\leq0 ≤ k=0K1𝔼s0d0[V0π(s0)V0πk(s0)]=k=0K1𝔼s0d0[(V0π(s0)V0k(s0))(V0πk(s0)V0k(s0))]superscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉subscript𝜋𝑘0subscript𝑠0superscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉𝑘0subscript𝑠0subscriptsuperscript𝑉subscript𝜋𝑘0subscript𝑠0subscriptsuperscript𝑉𝑘0subscript𝑠0\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0% })-V^{\pi_{k}}_{0}(s_{0})]=\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(V^{% \pi^{*}}_{0}(s_{0})-V^{k}_{0}(s_{0}))-(V^{\pi_{k}}_{0}(s_{0})-V^{k}_{0}(s_{0}))]∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ]
=k=0K1𝔼s0d0[(Q0π(s0,π(s0))Q0k(s0,πk(s0)))(Q0πk(s0,πk(s0))Q0k(s0,πk(s0)))]absentsuperscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑄superscript𝜋0subscript𝑠0superscript𝜋subscript𝑠0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0subscriptsuperscript𝑄subscript𝜋𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0\displaystyle=\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{*}}_{0}(s_% {0},\pi^{*}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))-(Q^{\pi_{k}}_{0}(s_{0},% \pi_{k}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))]= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) - ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) ]
k=0K1𝔼s0d0[(Q0π(s0,π(s0))Q0k(s0,πk(s0)))+](I)+k=0K1𝔼s0d0[(Q0k(s0,πk(s0))Q0πk(s0,πk(s0)))+](II).absentsubscriptsuperscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsubscriptsuperscript𝑄superscript𝜋0subscript𝑠0superscript𝜋subscript𝑠0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0𝐼subscriptsuperscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsubscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0subscriptsuperscript𝑄subscript𝜋𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0𝐼𝐼\displaystyle\leq\underbrace{\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{% \pi^{*}}_{0}(s_{0},\pi^{*}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]}_{(I)% }+\underbrace{\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}_{0}(s_{0},% \pi_{k}(s_{0}))-Q^{\pi_{k}}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]}_{(II)}.≤ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] end_ARG start_POSTSUBSCRIPT ( italic_I ) end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] end_ARG start_POSTSUBSCRIPT ( italic_I italic_I ) end_POSTSUBSCRIPT . (34)

We rewrite the state-action distribution dPoh,πsubscriptsuperscript𝑑𝜋superscript𝑃𝑜d^{h,\pi}_{P^{o}}italic_d start_POSTSUPERSCRIPT italic_h , italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, dropping Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, as dhπsubscriptsuperscript𝑑𝜋d^{\pi}_{h}italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for simplicity. Letting dhπsubscriptsuperscript𝑑𝜋d^{\pi}_{h}italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT also denote a state distribution (Δ(𝒮)Δ𝒮\Delta(\mathcal{S})roman_Δ ( caligraphic_S )), we can write it as, for all hhitalic_h,

dhπ={d0if h=0,Ph,s,aootherwise, with sdh1π,aπh(s).subscriptsuperscript𝑑𝜋casessubscript𝑑0if h=0subscriptsuperscript𝑃𝑜superscript𝑠superscript𝑎formulae-sequencesimilar-tootherwise, with superscript𝑠subscriptsuperscript𝑑𝜋1similar-tosuperscript𝑎subscript𝜋superscript𝑠d^{\pi}_{h}=\begin{cases}d_{0}&\text{if $h=0$},\\ P^{o}_{h,s^{\prime},a^{\prime}}&\text{otherwise, with }s^{\prime}\sim d^{\pi}_% {h-1},a^{\prime}\sim\pi_{h}(s^{\prime}).\end{cases}italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = { start_ROW start_CELL italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL if italic_h = 0 , end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL otherwise, with italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . end_CELL end_ROW (35)

Analyzing one term in (I)𝐼(I)( italic_I ) of (34) starting with the facts that πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the greedy policy with respect to Qksuperscript𝑄𝑘Q^{k}italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and function (x)+subscript𝑥(x)_{+}( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is non-decreasing in x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R:

𝔼s0d0[(Q0π(s0,π(s0))Q0k(s0,πk(s0)))+]𝔼s0,a0d0π[(Q0π(s0,a0)Q0k(s0,a0))+]subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsubscriptsuperscript𝑄superscript𝜋0subscript𝑠0superscript𝜋subscript𝑠0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0subscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0delimited-[]subscriptsubscriptsuperscript𝑄superscript𝜋0subscript𝑠0subscript𝑎0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝑎0\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{*}}_{0}(s_{0},\pi^{*}(s_{0}% ))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]\leq\mathbb{E}_{s_{0},a_{0}\sim d^{\pi% ^{*}}_{0}}[(Q^{\pi^{*}}_{0}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
(a)𝔼s0,a0d0π[(Q0π(s0,a0)𝒯Q1k(s0,a0))+]+𝔼s0,a0d0π[(𝒯Q1k(s0,a0)Q0k(s0,a0))+]superscript𝑎absentsubscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0delimited-[]subscriptsubscriptsuperscript𝑄superscript𝜋0subscript𝑠0subscript𝑎0𝒯subscriptsuperscript𝑄𝑘1subscript𝑠0subscript𝑎0subscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1subscript𝑠0subscript𝑎0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝑎0\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{*}}_{0}}[(Q^{\pi^{*}}_{0}(s_{0},a_{0})-\mathcal{T}Q^{k}_{1}(s_{0},a_{0}% ))_{+}]+\mathbb{E}_{s_{0},a_{0}\sim d^{\pi^{*}}_{0}}[(\mathcal{T}Q^{k}_{1}(s_{% 0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
(b)𝔼s0,a0d0π(supη(𝔼s1P0,s0,a0o[(ηmaxaQ1k(s1,a))+(ηmaxaQ1π(s1,a))+]))+superscript𝑏absentsubscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0subscriptsubscriptsupremum𝜂subscript𝔼similar-tosubscript𝑠1subscriptsuperscript𝑃𝑜0subscript𝑠0subscript𝑎0delimited-[]subscript𝜂subscriptsuperscript𝑎subscriptsuperscript𝑄𝑘1subscript𝑠1superscript𝑎subscript𝜂subscriptsuperscript𝑎subscriptsuperscript𝑄superscript𝜋1subscript𝑠1superscript𝑎\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{*}}_{0}}(\sup_{\eta}(\mathbb{E}_{s_{1}\sim P^{o}_{0,s_{0},a_{0}}}[(\eta% -\max_{a^{\prime}}Q^{k}_{1}(s_{1},a^{\prime}))_{+}-(\eta-\max_{a^{\prime}}Q^{% \pi^{*}}_{1}(s_{1},a^{\prime}))_{+}]))_{+}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_η - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
+𝔼s0,a0d0π[(𝒯Q1k(s0,a0)Q0k(s0,a0))+]subscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1subscript𝑠0subscript𝑎0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝑎0\displaystyle\hskip 85.35826pt+\mathbb{E}_{s_{0},a_{0}\sim d^{\pi^{*}}_{0}}[(% \mathcal{T}Q^{k}_{1}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]+ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
(c)𝔼s0,a0d0π(𝔼s1P0,s0,a0o(maxaQ1π(s1,a)maxaQ1k(s1,a))+)++𝔼s0,a0d0π[(𝒯Q1k(s0,a0)Q0k(s0,a0))+]superscript𝑐absentsubscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0subscriptsubscript𝔼similar-tosubscript𝑠1subscriptsuperscript𝑃𝑜0subscript𝑠0subscript𝑎0subscriptsubscriptsuperscript𝑎subscriptsuperscript𝑄superscript𝜋1subscript𝑠1superscript𝑎subscriptsuperscript𝑎subscriptsuperscript𝑄𝑘1subscript𝑠1superscript𝑎subscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1subscript𝑠0subscript𝑎0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝑎0\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{*}}_{0}}(\mathbb{E}_{s_{1}\sim P^{o}_{0,s_{0},a_{0}}}(\max_{a^{\prime}}% Q^{\pi^{*}}_{1}(s_{1},a^{\prime})-\max_{a^{\prime}}Q^{k}_{1}(s_{1},a^{\prime})% )_{+})_{+}+\mathbb{E}_{s_{0},a_{0}\sim d^{\pi^{*}}_{0}}[(\mathcal{T}Q^{k}_{1}(% s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_c ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
(d)𝔼s0,a0d0π𝔼s1P0,s0,a0o(Q1π(s1,π(s1))Q1k(s1,πk(s1)))++𝔼s0,a0d0π[(𝒯Q1k(s0,a0)Q0k(s0,a0))+]superscript𝑑absentsubscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0subscript𝔼similar-tosubscript𝑠1subscriptsuperscript𝑃𝑜0subscript𝑠0subscript𝑎0subscriptsubscriptsuperscript𝑄superscript𝜋1subscript𝑠1superscript𝜋subscript𝑠1subscriptsuperscript𝑄𝑘1subscript𝑠1subscript𝜋𝑘subscript𝑠1subscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1subscript𝑠0subscript𝑎0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝑎0\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{*}}_{0}}\mathbb{E}_{s_{1}\sim P^{o}_{0,s_{0},a_{0}}}(Q^{\pi^{*}}_{1}(s_% {1},\pi^{*}(s_{1}))-Q^{k}_{1}(s_{1},\pi_{k}(s_{1})))_{+}+\mathbb{E}_{s_{0},a_{% 0}\sim d^{\pi^{*}}_{0}}[(\mathcal{T}Q^{k}_{1}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{% 0}))_{+}]start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_d ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
=𝔼s0d1π[(Q1π(s1,π(s1))Q1k(s1,πk(s1)))+]+𝔼s0,a0d0π[(𝒯Q1k(s0,a0)Q0k(s0,a0))+],absentsubscript𝔼similar-tosubscript𝑠0subscriptsuperscript𝑑superscript𝜋1delimited-[]subscriptsubscriptsuperscript𝑄superscript𝜋1subscript𝑠1superscript𝜋subscript𝑠1subscriptsuperscript𝑄𝑘1subscript𝑠1subscript𝜋𝑘subscript𝑠1subscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑superscript𝜋0delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1subscript𝑠0subscript𝑎0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝑎0\displaystyle=\mathbb{E}_{s_{0}\sim d^{\pi^{*}}_{1}}[(Q^{\pi^{*}}_{1}(s_{1},% \pi^{*}(s_{1}))-Q^{k}_{1}(s_{1},\pi_{k}(s_{1})))_{+}]+\mathbb{E}_{s_{0},a_{0}% \sim d^{\pi^{*}}_{0}}[(\mathcal{T}Q^{k}_{1}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}% ))_{+}],= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] , (36)

where (a)𝑎(a)( italic_a ) follows by triangle inequality for ()+subscript(\cdot)_{+}( ⋅ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT operation, (b)𝑏(b)( italic_b ) from Bellman equation, operator 𝒯𝒯\mathcal{T}caligraphic_T, and the fact infxp(x)infxq(x)supx(p(x)q(x))subscriptinfimum𝑥𝑝𝑥subscriptinfimum𝑥𝑞𝑥subscriptsupremum𝑥𝑝𝑥𝑞𝑥\inf_{x}p(x)-\inf_{x}q(x)\leq\sup_{x}(p(x)-q(x))roman_inf start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p ( italic_x ) - roman_inf start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_q ( italic_x ) ≤ roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_p ( italic_x ) - italic_q ( italic_x ) ), (c)𝑐(c)( italic_c ) from the fact (x)+(y)+(xy)+subscript𝑥subscript𝑦subscript𝑥𝑦(x)_{+}-(y)_{+}\leq(x-y)_{+}( italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - ( italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ≤ ( italic_x - italic_y ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT for any x,y𝑥𝑦x,y\in\mathbb{R}italic_x , italic_y ∈ blackboard_R, (d)𝑑(d)( italic_d ) follows by Jensen’s inequality and by definitions of policies πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Now, recursively applying this method for first term over horizon in (36) we get

𝔼s0d0[(Q0π(s0,π(s0))Q0k(s0,πk(s0)))+]subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsubscriptsuperscript𝑄superscript𝜋0subscript𝑠0superscript𝜋subscript𝑠0subscriptsuperscript𝑄𝑘0subscript𝑠0subscript𝜋𝑘subscript𝑠0\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{*}}_{0}(s_{0},\pi^{*}(s_{0}% ))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
𝔼sHdH[(QHπ(sH,π(sH))QHk(sH,πk(sH)))+]+h=0H1𝔼s,adhπ[(𝒯Qh+1k(s,a)Qhk(s,a))+]absentsubscript𝔼similar-tosubscript𝑠𝐻subscript𝑑𝐻delimited-[]subscriptsubscriptsuperscript𝑄superscript𝜋𝐻subscript𝑠𝐻superscript𝜋subscript𝑠𝐻subscriptsuperscript𝑄𝑘𝐻subscript𝑠𝐻subscript𝜋𝑘subscript𝑠𝐻superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑superscript𝜋delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎subscriptsuperscript𝑄𝑘𝑠𝑎\displaystyle\leq\mathbb{E}_{s_{H}\sim d_{H}}[(Q^{\pi^{*}}_{H}(s_{H},\pi^{*}(s% _{H}))-Q^{k}_{H}(s_{H},\pi_{k}(s_{H})))_{+}]+\sum_{h=0}^{H-1}\mathbb{E}_{s,a% \sim d^{\pi^{*}}_{h}}[(\mathcal{T}Q^{k}_{h+1}(s,a)-Q^{k}_{h}(s,a))_{+}]≤ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] + ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
h=0H1𝔼s,adhπ[(𝒯Qh+1k(s,a)Qhk(s,a))+],absentsuperscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑superscript𝜋delimited-[]subscript𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎subscriptsuperscript𝑄𝑘𝑠𝑎\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{*}}_{h}}[(% \mathcal{T}Q^{k}_{h+1}(s,a)-Q^{k}_{h}(s,a))_{+}],≤ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] , (37)

where the last inequality holds since VHπ(sH)=0subscriptsuperscript𝑉𝜋𝐻subscript𝑠𝐻0V^{\pi}_{H}(s_{H})=0italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) = 0 for all π𝜋\piitalic_π and QHk(sH,πk(sH))=0subscriptsuperscript𝑄𝑘𝐻subscript𝑠𝐻subscript𝜋𝑘subscript𝑠𝐻0Q^{k}_{H}(s_{H},\pi_{k}(s_{H}))=0italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) ) = 0.

Recall

C(π)=maxfh=0H1𝔼s,adhπ[(𝒯fh+1(s,a)fh(s,a))+]h=0H1𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|].𝐶superscript𝜋subscript𝑓superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑superscript𝜋delimited-[]subscript𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎C(\pi^{*})=\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{% \pi^{*}}_{h}}[(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a))_{+}]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}.italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG .

Now, using (37) in (I)𝐼(I)( italic_I ) of (34), the following holds with probability at least 1δ/21𝛿21-\delta/21 - italic_δ / 2:

k=0K1𝔼s0d0[(Qπ(s0,π(s0))\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{*}}(s_{0},% \pi^{*}(s_{0}))∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) Qk(s0,πk(s0)))+]k=0K1h=0H1𝔼s,adhπ[(𝒯Qh+1k(s,a)Qhk(s,a))+]\displaystyle-Q^{k}(s_{0},\pi_{k}(s_{0})))_{+}]\leq\sum_{k=0}^{K-1}\sum_{h=0}^% {H-1}\mathbb{E}_{s,a\sim d^{\pi^{*}}_{h}}[(\mathcal{T}Q^{k}_{h+1}(s,a)-Q^{k}_{% h}(s,a))_{+}]- italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
(e)k=0K1C(π)h=0H1𝒯Qh+1kQhk1,μhsuperscript𝑒absentsuperscriptsubscript𝑘0𝐾1𝐶superscript𝜋superscriptsubscript0𝐻1subscriptnorm𝒯subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘1subscript𝜇\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}\sum_{k=0}^{K-1}C(\pi^{*})% \sum_{h=0}^{H-1}\|\mathcal{T}Q^{k}_{h+1}-Q^{k}_{h}\|_{1,\mu_{h}}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_e ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT
(f)k=0K1C(π)h=0H1(𝒯Qh+1k𝒯ghkQh+1k1,μh+𝒯ghkQh+1kQhk2,μh)superscript𝑓absentsuperscriptsubscript𝑘0𝐾1𝐶superscript𝜋superscriptsubscript0𝐻1subscriptnorm𝒯subscriptsuperscript𝑄𝑘1subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘11subscript𝜇subscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2subscript𝜇\displaystyle\stackrel{{\scriptstyle(f)}}{{\leq}}\sum_{k=0}^{K-1}C(\pi^{*})% \sum_{h=0}^{H-1}(\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}\|% _{1,\mu_{h}}+\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|_{2,\mu_{h}})start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_f ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ( ∥ caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
(g)KHC(π)(Δdual,off+ΔrQ,off),superscript𝑔absent𝐾𝐻𝐶superscript𝜋subscriptΔdualoffsubscriptΔrQoff\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}KHC(\pi^{*})(\Delta_{\mathrm{% dual,off}}+\Delta_{\mathrm{rQ,off}}),start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_g ) end_ARG end_RELOP italic_K italic_H italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( roman_Δ start_POSTSUBSCRIPT roman_dual , roman_off end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_off end_POSTSUBSCRIPT ) , (38)

where (e)𝑒(e)( italic_e ) follows from definition of C(π)𝐶superscript𝜋C(\pi^{*})italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) in Assumption 4, (f)𝑓(f)( italic_f ) from triangle inequality and the fact 1,μ2,μ\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ≤ ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT, and (g)𝑔(g)( italic_g ) follows from Propositions 9 and 10.

For (II)𝐼𝐼(II)( italic_I italic_I ), firstly we note 𝔼s0d0[(Qk(s0,πk(s0))Qπk(s0,πk(s0)))+]=𝔼s0,a0d0πk[(Qk(s0,a0)Qπk(s0,a0))+]subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑄𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0superscript𝑄subscript𝜋𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0subscript𝔼similar-tosubscript𝑠0subscript𝑎0subscriptsuperscript𝑑subscript𝜋𝑘0delimited-[]subscriptsuperscript𝑄𝑘subscript𝑠0subscript𝑎0superscript𝑄subscript𝜋𝑘subscript𝑠0subscript𝑎0\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}(s_{0},\pi_{k}(s_{0}))-Q^{\pi_{k}}(s_{0},% \pi_{k}(s_{0})))_{+}]=\mathbb{E}_{s_{0},a_{0}\sim d^{\pi_{k}}_{0}}[(Q^{k}(s_{0% },a_{0})-Q^{\pi_{k}}(s_{0},a_{0}))_{+}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]. So, following the same analysis as in (I)𝐼(I)( italic_I ), we get

𝔼s0d0[(Qk(s0,πk(s0))Qπk(s0,πk(s0)))+]h=0H1𝔼s,adhπk[(Qhk(s,a)𝒯Qh+1k(s,a))+]subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑄𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0superscript𝑄subscript𝜋𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsubscriptsuperscript𝑄𝑘𝑠𝑎𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}(s_{0},\pi_{k}(s_{0}))-Q^{\pi_% {k}}(s_{0},\pi_{k}(s_{0})))_{+}]\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi% _{k}}_{h}}[(Q^{k}_{h}(s,a)-\mathcal{T}Q^{k}_{h+1}(s,a))_{+}]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
h=0H1𝔼s,adhπk[(Qhk(s,a)(𝒯ghkQh+1k)(s,a))++((𝒯ghkQh+1k)(s,a)(𝒯Qh+1k)(s,a))+],absentsuperscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsubscriptsuperscript𝑄𝑘𝑠𝑎subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1𝑠𝑎subscriptsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1𝑠𝑎𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[(Q^{k}_{% h}(s,a)-(\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1})(s,a))_{+}+((\mathcal{T}_{g^{k}_{h% }}Q^{k}_{h+1})(s,a)-(\mathcal{T}Q^{k}_{h+1})(s,a))_{+}],≤ ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + ( ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) - ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] , (39)

where the last inequality follows by triangle inequality for ()+subscript(\cdot)_{+}( ⋅ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT operation.

Now, using (39) in (II)𝐼𝐼(II)( italic_I italic_I ) of (34), we have

k=0K1𝔼s0d0[(Qk(s0,πk(s0))Qπk(s0,πk(s0)))+]superscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑄𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0superscript𝑄subscript𝜋𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0absent\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}(s_{0},\pi_{k}% (s_{0}))-Q^{\pi_{k}}(s_{0},\pi_{k}(s_{0})))_{+}]\leq∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤
k=0K1h=0H1𝔼s,adhπk[(Qhk(s,a)(𝒯ghkQh+1k)(s,a))+]+k=0K1h=0H1𝔼s,adhπk[((𝒯ghkQh+1k)(s,a)𝒯Qh+1k(s,a))+].superscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsubscriptsuperscript𝑄𝑘𝑠𝑎subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1𝑠𝑎superscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1𝑠𝑎𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{k}}_{% h}}[(Q^{k}_{h}(s,a)-(\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1})(s,a))_{+}]+\sum_{k=0}% ^{K-1}\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[((\mathcal{T}_{g^{k% }_{h}}Q^{k}_{h+1})(s,a)-\mathcal{T}Q^{k}_{h+1}(s,a))_{+}].∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] + ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) - caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] . (40)

Recall bilinear model from Assumption 7: 𝔼dhπf[(fh𝒯ghfh+1)+]=|Xh(f),Whq(f,g)|subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓delimited-[]subscriptsubscript𝑓subscript𝒯subscript𝑔subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊q𝑓𝑔\mathbb{E}_{d^{\pi^{f}}_{h}}[(f_{h}-\mathcal{T}_{g_{h}}f_{h+1})_{+}]=\left% \lvert\left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle\right\rvertblackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = | ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩ |.

Analyzing the first part of (40), the following holds with probability at least 1δ/21𝛿21-\delta/21 - italic_δ / 2:

k=0K1h=0H1superscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT 𝔼dhπk[(Qhk𝒯ghkQh+1k)+]=(h)k=0K1h=0H1|Xh(Qk),Whq(Qk,gk)|superscriptsubscript𝔼subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsubscriptsuperscript𝑄𝑘subscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1superscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscript𝑋superscript𝑄𝑘subscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘\displaystyle\mathbb{E}_{d^{\pi_{k}}_{h}}[(Q^{k}_{h}-\mathcal{T}_{g^{k}_{h}}Q^% {k}_{h+1})_{+}]\stackrel{{\scriptstyle(h)}}{{=}}\sum_{k=0}^{K-1}\sum_{h=0}^{H-% 1}\left\lvert\left\langle X_{h}(Q^{k}),W^{\mathrm{q}}_{h}(Q^{k},g^{k})\right% \rangle\right\rvertblackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_h ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT | ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ⟩ |
(i)k=0K1h=0H1Xh(Qk)Σk1;h1Whq(Qk,gk)Σk1;hsuperscript𝑖absentsuperscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11subscriptnormsubscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘subscriptΣ𝑘1\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\|X_{h}(Q^{k})\|_{\Sigma_{k-1;h}^{-1}}\|W^{\mathrm{q}}_{h}(Q^{k},g^{k})\|_{% \Sigma_{k-1;h}}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=k=0K1h=0H1Xh(Qk)Σk1;h1(Whq(Qk,gk))Σk1;hWhq(Qk,gk)absentsuperscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11superscriptsubscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘topsubscriptΣ𝑘1subscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘\displaystyle=\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\|X_{h}(Q^{k})\|_{\Sigma_{k-1;h}% ^{-1}}\sqrt{(W^{\mathrm{q}}_{h}(Q^{k},g^{k}))^{\top}\Sigma_{k-1;h}W^{\mathrm{q% }}_{h}(Q^{k},g^{k})}= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG ( italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_ARG
=k=0K1h=0H1Xh(Qk)Σk1;h1(Whq(Qk,gk))(i=0k1Xh(Qi)Xh(Qi)+σ𝟙)Whq(Qk,gk)absentsuperscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11superscriptsubscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘topsuperscriptsubscript𝑖0𝑘1subscript𝑋superscript𝑄𝑖subscript𝑋superscriptsuperscript𝑄𝑖top𝜎1subscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘\displaystyle=\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\|X_{h}(Q^{k})\|_{\Sigma_{k-1;h}% ^{-1}}\sqrt{(W^{\mathrm{q}}_{h}(Q^{k},g^{k}))^{\top}(\sum_{i=0}^{k-1}X_{h}(Q^{% i})X_{h}(Q^{i})^{\top}+\sigma\mathds{1})W^{\mathrm{q}}_{h}(Q^{k},g^{k})}= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG ( italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ blackboard_1 ) italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_ARG
=k=0K1h=0H1Xh(Qk)Σk1;h1i=0k1|Whq(Qk,gk),Xh(Qi)|2+σWhq(Qk,gk)2absentsuperscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11superscriptsubscript𝑖0𝑘1superscriptsubscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘subscript𝑋superscript𝑄𝑖2𝜎superscriptnormsubscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘2\displaystyle=\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\|X_{h}(Q^{k})\|_{\Sigma_{k-1;h}% ^{-1}}\sqrt{\sum_{i=0}^{k-1}\left\lvert\left\langle W^{\mathrm{q}}_{h}(Q^{k},g% ^{k}),X_{h}(Q^{i})\right\rangle\right\rvert^{2}+\sigma\left\|W^{\mathrm{q}}_{h% }(Q^{k},g^{k})\right\|^{2}}= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT | ⟨ italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ ∥ italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(j)k=0K1h=0H1Xh(Qk)Σk1;h1i=0k1|Whq(Qk,gk),Xh(Qi)|2+σBW2superscript𝑗absentsuperscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11superscriptsubscript𝑖0𝑘1superscriptsubscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘subscript𝑋superscript𝑄𝑖2𝜎superscriptsubscript𝐵𝑊2\displaystyle\stackrel{{\scriptstyle(j)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\|X_{h}(Q^{k})\|_{\Sigma_{k-1;h}^{-1}}\sqrt{\sum_{i=0}^{k-1}\left\lvert% \left\langle W^{\mathrm{q}}_{h}(Q^{k},g^{k}),X_{h}(Q^{i})\right\rangle\right% \rvert^{2}+\sigma B_{W}^{2}}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_j ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT | ⟨ italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(k)k=0K1h=0H1Xh(Qk)Σk1;h1i=0k1𝒯ghkQh+1kQhk2,dhπi2+σBW2superscript𝑘absentsuperscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11superscriptsubscript𝑖0𝑘1superscriptsubscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2subscriptsuperscript𝑑subscript𝜋𝑖2𝜎superscriptsubscript𝐵𝑊2\displaystyle\stackrel{{\scriptstyle(k)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\|X_{h}(Q^{k})\|_{\Sigma_{k-1;h}^{-1}}\sqrt{\sum_{i=0}^{k-1}\|\mathcal{T}_{% g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|_{2,d^{\pi_{i}}_{h}}^{2}+\sigma B_{W}^{2}}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_k ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(l)k=0K1h=0H1Xh(Qk)Σk1;h1(i=0k1𝒯ghkQh+1kQhk2,dhπi2+σBW2)superscript𝑙absentsuperscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11superscriptsubscript𝑖0𝑘1superscriptsubscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2subscriptsuperscript𝑑subscript𝜋𝑖2𝜎superscriptsubscript𝐵𝑊2\displaystyle\stackrel{{\scriptstyle(l)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\|X_{h}(Q^{k})\|_{\Sigma_{k-1;h}^{-1}}(\sqrt{\sum_{i=0}^{k-1}\|\mathcal{T}_% {g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|_{2,d^{\pi_{i}}_{h}}^{2}}+\sqrt{\sigma B_{W}% ^{2}})start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_l ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG italic_σ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
(m)(ΔrQ,on+σBW2)k=0K1h=0H1Xh(Qk)Σk1;h1superscript𝑚absentsubscriptΔrQon𝜎superscriptsubscript𝐵𝑊2superscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1subscriptnormsubscript𝑋superscript𝑄𝑘superscriptsubscriptΣ𝑘11\displaystyle\stackrel{{\scriptstyle(m)}}{{\leq}}(\Delta_{\mathrm{rQ,on}}+% \sqrt{\sigma B_{W}^{2}})\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\|X_{h}(Q^{k})\|_{% \Sigma_{k-1;h}^{-1}}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_m ) end_ARG end_RELOP ( roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_on end_POSTSUBSCRIPT + square-root start_ARG italic_σ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_k - 1 ; italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
(n)(ΔrQ,on+BXBW)2dH2log(1+Kd)K,superscript𝑛absentsubscriptΔrQonsubscript𝐵𝑋subscript𝐵𝑊2𝑑superscript𝐻21𝐾𝑑𝐾\displaystyle\stackrel{{\scriptstyle(n)}}{{\leq}}(\Delta_{\mathrm{rQ,on}}+B_{X% }B_{W})\sqrt{2dH^{2}\log(1+\frac{K}{d})K},start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_n ) end_ARG end_RELOP ( roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_on end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) square-root start_ARG 2 italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 + divide start_ARG italic_K end_ARG start_ARG italic_d end_ARG ) italic_K end_ARG , (41)

where (h)(h)( italic_h ) follows from Assumption 7, (i)𝑖(i)( italic_i ) from matrix Cauchy-Schwarz inequality, (j)𝑗(j)( italic_j ) from Assumption 7, and (k)𝑘(k)( italic_k ) by Assumption 7 with 1,dhπi2,dhπi\|\cdot\|_{1,d^{\pi_{i}}_{h}}\leq\|\cdot\|_{2,d^{\pi_{i}}_{h}}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 , italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 , italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

|Whq(Qk,gk),Xh(Qi)|subscriptsuperscript𝑊qsuperscript𝑄𝑘superscript𝑔𝑘subscript𝑋superscript𝑄𝑖\displaystyle|\left\langle W^{\mathrm{q}}_{h}(Q^{k},g^{k}),X_{h}(Q^{i})\right\rangle|| ⟨ italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ⟩ | =𝔼s,adhπi[(Qhk(s,a)(𝒯Qh+1k)(s,a))+]𝒯ghkQh+1kQhk2,dhπi.absentsubscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑖delimited-[]subscriptsubscriptsuperscript𝑄𝑘𝑠𝑎𝒯subscriptsuperscript𝑄𝑘1𝑠𝑎subscriptnormsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1subscriptsuperscript𝑄𝑘2subscriptsuperscript𝑑subscript𝜋𝑖\displaystyle=\mathbb{E}_{s,a\sim d^{\pi_{i}}_{h}}[(Q^{k}_{h}(s,a)-(\mathcal{T% }Q^{k}_{h+1})(s,a))_{+}]\leq\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|_{% 2,d^{\pi_{i}}_{h}}.= blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - ( caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤ ∥ caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Finally, (l)𝑙(l)( italic_l ) follows by the fact x+yx+y𝑥𝑦𝑥𝑦\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}square-root start_ARG italic_x + italic_y end_ARG ≤ square-root start_ARG italic_x end_ARG + square-root start_ARG italic_y end_ARG, (m)𝑚(m)( italic_m ) follows from Proposition 10, and (n)𝑛(n)( italic_n ) follows from Lemma 6.

Now recall bilinear model from Assumption 7: 𝔼dhπf[(𝒯ghfh+1𝒯fh+1)+]=|Xh(f),Whd(f,g)|subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓delimited-[]subscriptsubscript𝒯subscript𝑔subscript𝑓1𝒯subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊d𝑓𝑔\mathbb{E}_{d^{\pi^{f}}_{h}}[(\mathcal{T}_{g_{h}}f_{h+1}-\mathcal{T}f_{h+1})_{% +}]=\left\lvert\left\langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangle\right\rvertblackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = | ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩ |. Following analysis above in (41) for the second part of (40) using Assumption 7 and Proposition 9, the following holds with probability at least 1δ/21𝛿21-\delta/21 - italic_δ / 2:

k=0K1h=0H1superscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT 𝔼s,adhπk[(𝒯ghkQh+1k𝒯Qh+1k)+](Δdual,on+BXBW)2dH2log(1+Kd)K.subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsubscript𝒯subscriptsuperscript𝑔𝑘subscriptsuperscript𝑄𝑘1𝒯subscriptsuperscript𝑄𝑘1subscriptΔdualonsubscript𝐵𝑋subscript𝐵𝑊2𝑑superscript𝐻21𝐾𝑑𝐾\displaystyle\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[(\mathcal{T}_{g^{k}_{h}}Q^{k% }_{h+1}-\mathcal{T}Q^{k}_{h+1})_{+}]\leq(\Delta_{\mathrm{dual,on}}+B_{X}B_{W})% \sqrt{2dH^{2}\log(1+\frac{K}{d})K}.blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤ ( roman_Δ start_POSTSUBSCRIPT roman_dual , roman_on end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) square-root start_ARG 2 italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 + divide start_ARG italic_K end_ARG start_ARG italic_d end_ARG ) italic_K end_ARG . (42)

Now combining Eqs. 41 and 42 with (40) we have

k=0K1h=0H1superscriptsubscript𝑘0𝐾1superscriptsubscript0𝐻1\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT 𝔼s,adhπk[(Qk(s0,πk(s0))Qπk(s0,πk(s0)))+]subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑subscript𝜋𝑘delimited-[]subscriptsuperscript𝑄𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0superscript𝑄subscript𝜋𝑘subscript𝑠0subscript𝜋𝑘subscript𝑠0\displaystyle\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[(Q^{k}(s_{0},\pi_{k}(s_{0}))% -Q^{\pi_{k}}(s_{0},\pi_{k}(s_{0})))_{+}]blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ]
(Δdual,on+ΔrQ,on+2BXBW)2dH2log(1+Kd)K,absentsubscriptΔdualonsubscriptΔrQon2subscript𝐵𝑋subscript𝐵𝑊2𝑑superscript𝐻21𝐾𝑑𝐾\displaystyle\leq(\Delta_{\mathrm{dual,on}}+\Delta_{\mathrm{rQ,on}}+2B_{X}B_{W% })\sqrt{2dH^{2}\log(1+\frac{K}{d})K},≤ ( roman_Δ start_POSTSUBSCRIPT roman_dual , roman_on end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_on end_POSTSUBSCRIPT + 2 italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) square-root start_ARG 2 italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 + divide start_ARG italic_K end_ARG start_ARG italic_d end_ARG ) italic_K end_ARG ,

with probability at least 1δ1𝛿1-\delta1 - italic_δ. Finally, we combine this and (38) with (34):

00\displaystyle 0 k=0K1𝔼s0d0[V0π(s0)V0πk(s0)]KHC(π)(Δdual,off+ΔrQ,off)+absentsuperscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉subscript𝜋𝑘0subscript𝑠0limit-from𝐾𝐻𝐶superscript𝜋subscriptΔdualoffsubscriptΔrQoff\displaystyle\leq\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(% s_{0})-V^{\pi_{k}}_{0}(s_{0})]\leq KHC(\pi^{*})(\Delta_{\mathrm{dual,off}}+% \Delta_{\mathrm{rQ,off}})+≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ italic_K italic_H italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( roman_Δ start_POSTSUBSCRIPT roman_dual , roman_off end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_off end_POSTSUBSCRIPT ) +
(Δdual,on+ΔrQ,on+2BXBW)2dH2log(1+Kd)K.subscriptΔdualonsubscriptΔrQon2subscript𝐵𝑋subscript𝐵𝑊2𝑑superscript𝐻21𝐾𝑑𝐾\displaystyle\hskip 85.35826pt(\Delta_{\mathrm{dual,on}}+\Delta_{\mathrm{rQ,on% }}+2B_{X}B_{W})\sqrt{2dH^{2}\log(1+\frac{K}{d})K}.( roman_Δ start_POSTSUBSCRIPT roman_dual , roman_on end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_on end_POSTSUBSCRIPT + 2 italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) square-root start_ARG 2 italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 + divide start_ARG italic_K end_ARG start_ARG italic_d end_ARG ) italic_K end_ARG .

Let N=moff+Kmon𝑁subscript𝑚off𝐾subscript𝑚onN=m_{\mathrm{off}}+K\cdot m_{\mathrm{on}}italic_N = italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT + italic_K ⋅ italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT. Using offline bounds from Propositions 9 and 10 with c1=2λ+Hsubscript𝑐12𝜆𝐻c_{1}=2\lambda+Hitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 italic_λ + italic_H from Proposition 3, we have:

00\displaystyle 0 k=0K1𝔼s0d0[V0π(s0)V0πk(s0)]KHC(π)\displaystyle\leq\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(% s_{0})-V^{\pi_{k}}_{0}(s_{0})]\leq KHC(\pi^{*})\cdot≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ italic_K italic_H italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅
(1moff(3ε𝒢N+48(2λ+H)log(2HK|𝒢|||/δ))+1moff(3ε,rN+8(1+2λ+2H)log(2HK|𝒢|||/δ)))1subscript𝑚off3subscript𝜀𝒢𝑁482𝜆𝐻2𝐻𝐾𝒢𝛿1subscript𝑚off3subscript𝜀r𝑁812𝜆2𝐻2𝐻𝐾𝒢𝛿\displaystyle(\frac{1}{m_{\mathrm{off}}}\left(3\varepsilon_{\mathcal{G}}N+48(2% \lambda+H)\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)\right)+\frac{1}{\sqrt{m_{% \mathrm{off}}}}\left(\sqrt{3\varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+2% \lambda+2H)\sqrt{\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)}\right))( divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG ( 3 italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT italic_N + 48 ( 2 italic_λ + italic_H ) roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) ) + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG end_ARG ( square-root start_ARG 3 italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT italic_N end_ARG + 8 ( 1 + 2 italic_λ + 2 italic_H ) square-root start_ARG roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) end_ARG ) )
+(Δdual,on+ΔrQ,on+2BXBW)2dH2log(1+Kd)K.subscriptΔdualonsubscriptΔrQon2subscript𝐵𝑋subscript𝐵𝑊2𝑑superscript𝐻21𝐾𝑑𝐾\displaystyle+(\Delta_{\mathrm{dual,on}}+\Delta_{\mathrm{rQ,on}}+2B_{X}B_{W})% \sqrt{2dH^{2}\log(1+\frac{K}{d})K}.+ ( roman_Δ start_POSTSUBSCRIPT roman_dual , roman_on end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT roman_rQ , roman_on end_POSTSUBSCRIPT + 2 italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) square-root start_ARG 2 italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 + divide start_ARG italic_K end_ARG start_ARG italic_d end_ARG ) italic_K end_ARG .

Now using on-policy bounds from Propositions 9 and 10 with c1=2λ+Hsubscript𝑐12𝜆𝐻c_{1}=2\lambda+Hitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 italic_λ + italic_H from Proposition 3, we have:

00\displaystyle 0 k=0K1𝔼s0d0[V0π(s0)V0πk(s0)]KHC(π)\displaystyle\leq\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(% s_{0})-V^{\pi_{k}}_{0}(s_{0})]\leq KHC(\pi^{*})\cdot≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ italic_K italic_H italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅
(1moff(3ε𝒢N+48(2λ+H)log(2HK|𝒢|||/δ))+1moff(3ε,rN+8(1+2λ+2H)log(2HK|𝒢|||/δ)))1subscript𝑚off3subscript𝜀𝒢𝑁482𝜆𝐻2𝐻𝐾𝒢𝛿1subscript𝑚off3subscript𝜀r𝑁812𝜆2𝐻2𝐻𝐾𝒢𝛿\displaystyle(\frac{1}{m_{\mathrm{off}}}\left(3\varepsilon_{\mathcal{G}}N+48(2% \lambda+H)\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)\right)+\frac{1}{\sqrt{m_{% \mathrm{off}}}}\left(\sqrt{3\varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+2% \lambda+2H)\sqrt{\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)}\right))( divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG ( 3 italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT italic_N + 48 ( 2 italic_λ + italic_H ) roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) ) + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG end_ARG ( square-root start_ARG 3 italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT italic_N end_ARG + 8 ( 1 + 2 italic_λ + 2 italic_H ) square-root start_ARG roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) end_ARG ) )
+(1mon(3ε𝒢N+48(2λ+H)log(2HK|𝒢|||/δ))\displaystyle+(\frac{1}{m_{\mathrm{on}}}\left(3\varepsilon_{\mathcal{G}}N+48(2% \lambda+H)\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)\right)+ ( divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_ARG ( 3 italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT italic_N + 48 ( 2 italic_λ + italic_H ) roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) )
+1mon(3ε,rN+8(1+2λ+2H)log(2HK|𝒢|||/δ))+2BXBW)2dH2log(1+Kd)K\displaystyle\hskip 42.67912pt+\frac{1}{\sqrt{m_{\mathrm{on}}}}\left(\sqrt{3% \varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+2\lambda+2H)\sqrt{\log(2HK|\mathcal% {G}||\mathcal{F}|/\delta)}\right)+2B_{X}B_{W})\cdot\sqrt{2dH^{2}\log(1+\frac{K% }{d})K}+ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_ARG end_ARG ( square-root start_ARG 3 italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT italic_N end_ARG + 8 ( 1 + 2 italic_λ + 2 italic_H ) square-root start_ARG roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) end_ARG ) + 2 italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) ⋅ square-root start_ARG 2 italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 + divide start_ARG italic_K end_ARG start_ARG italic_d end_ARG ) italic_K end_ARG

Finally, choosing higher order terms by setting mon=1subscript𝑚on1m_{\mathrm{on}}=1italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT = 1 and moff=Ksubscript𝑚off𝐾m_{\mathrm{off}}=Kitalic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = italic_K, we get

00absent\displaystyle 0\leq0 ≤ k=0K1𝔼s0d0[V0π(s0)V0πk(s0)]superscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉subscript𝜋𝑘0subscript𝑠0\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0% })-V^{\pi_{k}}_{0}(s_{0})]∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ]
KHC(π)(6(ε𝒢+ε,r)K2+(8+112λ+64H)log(2HK|𝒢|||/δ))absent𝐾𝐻𝐶superscript𝜋6subscript𝜀𝒢subscript𝜀rsuperscript𝐾28112𝜆64𝐻2𝐻𝐾𝒢𝛿\displaystyle\leq\sqrt{K}HC(\pi^{*})(6(\varepsilon_{\mathcal{G}}+\sqrt{% \varepsilon_{\mathcal{F},\mathrm{r}}})K^{2}+(8+112\lambda+64H)\log(2HK|% \mathcal{G}||\mathcal{F}|/\delta))≤ square-root start_ARG italic_K end_ARG italic_H italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( 6 ( italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT + square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 8 + 112 italic_λ + 64 italic_H ) roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) )
+(6(ε𝒢+ε,r)K2+8+112λ+64Hlog(2HK|𝒢|||/δ)+2BXBW)2dH2log(1+Kd)K6subscript𝜀𝒢subscript𝜀rsuperscript𝐾28112𝜆64𝐻2𝐻𝐾𝒢𝛿2subscript𝐵𝑋subscript𝐵𝑊2𝑑superscript𝐻21𝐾𝑑𝐾\displaystyle+(6(\varepsilon_{\mathcal{G}}+\sqrt{\varepsilon_{\mathcal{F},% \mathrm{r}}})K^{2}+8+112\lambda+64H\log(2HK|\mathcal{G}||\mathcal{F}|/\delta)+% 2B_{X}B_{W})\cdot\sqrt{2dH^{2}\log(1+\frac{K}{d})K}+ ( 6 ( italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT + square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 8 + 112 italic_λ + 64 italic_H roman_log ( 2 italic_H italic_K | caligraphic_G | | caligraphic_F | / italic_δ ) + 2 italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) ⋅ square-root start_ARG 2 italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 + divide start_ARG italic_K end_ARG start_ARG italic_d end_ARG ) italic_K end_ARG
𝒪((ε,r+ε𝒢)K5/2H)+𝒪~(max{C(π),1}dKH2(λ+H)log(HK|||𝒢|/δ)log(1+(K/d))).absent𝒪subscript𝜀rsubscript𝜀𝒢superscript𝐾52𝐻~𝒪𝐶superscript𝜋1𝑑𝐾superscript𝐻2𝜆𝐻𝐻𝐾𝒢𝛿1𝐾𝑑\displaystyle\leq\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1% \}\sqrt{dKH^{2}}(\lambda+H)\log(HK|\mathcal{F}||\mathcal{G}|/\delta)\sqrt{\log% (1+(K/d))}).≤ caligraphic_O ( ( square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG + italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT italic_H ) + over~ start_ARG caligraphic_O end_ARG ( roman_max { italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , 1 } square-root start_ARG italic_d italic_K italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_λ + italic_H ) roman_log ( italic_H italic_K | caligraphic_F | | caligraphic_G | / italic_δ ) square-root start_ARG roman_log ( 1 + ( italic_K / italic_d ) ) end_ARG ) .

The proof is now complete. ∎

E.2 HyTQ Algorithm Specialized Results ☕☕☕

In this section we specialize our main result Theorem 2 for different bilinear model classes and also provide an equivalent sample complexity guarantee in the offline robust RL setting.

Before we move ahead, we showcase an important property of our robust transfer coefficient C(π)𝐶𝜋C(\pi)italic_C ( italic_π ) for any fixed policy. Fixing a nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, the transfer coefficient considers the distribution shift w.r.t the data-generating distribution along the general function class which the algorithm uses. It is in fact smaller than the existing density ratio based concentrability assumption (Assumption 9). We state this result in the following lemma.

Lemma 8.

For any policy π𝜋\piitalic_π and offline distribution μ𝜇\muitalic_μ, we have C(π)suph,s,adhπ(s,a)/μh(s,a).𝐶𝜋subscriptsupremum𝑠𝑎subscriptsuperscript𝑑𝜋𝑠𝑎subscript𝜇𝑠𝑎C(\pi)\leq\sup_{h,s,a}{d^{\pi}_{h}(s,a)}/{\mu_{h}(s,a)}.italic_C ( italic_π ) ≤ roman_sup start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) / italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) .

Proof.

By definition in Assumption 4, we get that

C(π)𝐶𝜋\displaystyle C(\pi)italic_C ( italic_π ) =maxfh=0H1𝔼s,adhπ[(𝒯fh+1(s,a)fh(s,a))+]h=0H1𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]absentsubscript𝑓superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑𝜋delimited-[]subscript𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎\displaystyle=\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d% ^{\pi}_{h}}[(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a))_{+}]}{\sum_{h=0}^{H-1}\mathbb% {E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}= roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG
maxfh=0H1𝔼s,adhπ[|𝒯fh+1(s,a)fh(s,a)|]h=0H1𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]absentsubscript𝑓superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑𝜋delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎\displaystyle\leq\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a% \sim d^{\pi}_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}≤ roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG
(a)maxf,h[H]𝔼s,adhπ[|𝒯fh+1(s,a)fh(s,a)|]𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]suph,s,adhπ(s,a)μh(s,a),superscript𝑎absentsubscriptformulae-sequence𝑓delimited-[]𝐻subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑𝜋delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎subscriptsupremum𝑠𝑎subscriptsuperscript𝑑𝜋𝑠𝑎subscript𝜇𝑠𝑎\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\max_{f\in\mathcal{F},h\in[H]% }\frac{\mathbb{E}_{s,a\sim d^{\pi}_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}% {\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}\leq\sup_{h% ,s,a}\frac{d^{\pi}_{h}(s,a)}{\mu_{h}(s,a)},start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F , italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG ≤ roman_sup start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT divide start_ARG italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG ,

where (a)𝑎(a)( italic_a ) follows from the Mediant inequality. ∎

Remark 5.

The concentrability assumption (Assumption 9) is in fact the same non-robust RL concentrability assumption (Munos and Szepesvári,, 2008; Chen and Jiang,, 2019). We make two important points here. Firstly, our transfer coefficient is larger than the transfer coefficient (Song et al.,, 2023, Definition 1) using the fact 1,μ2,μ\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 , italic_μ end_POSTSUBSCRIPT ≤ ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 , italic_μ end_POSTSUBSCRIPT. Secondly, our transfer coefficient is not directly comparable with the l2-norm version transfer coefficient (Xie et al.,, 2021, Definition 1). It is an interesting open question for future research to investigate about minimax lower bound guarantees w.r.t different transfer coefficients for both non-robust and robust RL problems.

We now define a bilinear model called Low Occupancy Complexity (Du et al.,, 2021, Definition 4.7). The nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and realizable function class \mathcal{F}caligraphic_F has low occupancy complexity w.r.t., for each h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], a (possibly unknown to the learner) feature map ψ=(ψh:𝒮×𝒜𝒴)\psi=(\psi_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{Y})italic_ψ = ( italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → caligraphic_Y ), where 𝒴𝒴\mathcal{Y}caligraphic_Y is a Hilbert space, and w.r.t. to a (possibly unknown to the learner) map νh:𝒴:subscript𝜈maps-to𝒴\nu_{h}:\mathcal{F}\mapsto\mathcal{Y}italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_F ↦ caligraphic_Y such that for all f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, with greedy policy πfsuperscript𝜋𝑓\pi^{f}italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT w.r.t. f𝑓fitalic_f, and (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) we have

dPoh,πf(s,a)=ψh(s,a),νh(f).subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜𝑠𝑎subscript𝜓𝑠𝑎subscript𝜈𝑓\displaystyle d^{h,\pi^{f}}_{P^{o}}(s,a)=\langle\psi_{h}(s,a),\nu_{h}(f)\rangle.italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) = ⟨ italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) ⟩ . (43)

We make the following assumption on the offline data-generating distribution (or policy by slight notational override for convenience).

Assumption 11.

Consider the Low Occupancy Complexity model (bilinear model) on 𝒴=d𝒴superscript𝑑\mathcal{Y}=\mathbb{R}^{d}caligraphic_Y = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Let the offline data distribution μ={μh}h[H]𝜇subscriptsubscript𝜇delimited-[]𝐻\mu=\{\mu_{h}\}_{h\in[H]}italic_μ = { italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT satisfy a low rank structure, i.e. μh(s,a)=ψh(s,a),νh(foff)=i[d]ψh,i(s,a)νh,i(foff)subscript𝜇𝑠𝑎subscript𝜓𝑠𝑎subscript𝜈superscript𝑓offsubscript𝑖delimited-[]𝑑subscript𝜓𝑖𝑠𝑎subscript𝜈𝑖superscript𝑓off\mu_{h}(s,a)=\langle\psi_{h}(s,a),\nu_{h}(f^{\mathrm{off}})\rangle=\sum_{i\in[% d]}\psi_{h,i}(s,a)\nu_{h,i}(f^{\mathrm{off}})italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) = ⟨ italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT roman_off end_POSTSUPERSCRIPT ) ⟩ = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_d ] end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_ν start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT roman_off end_POSTSUPERSCRIPT ), for some foffsuperscript𝑓offf^{\mathrm{off}}\in\mathcal{F}italic_f start_POSTSUPERSCRIPT roman_off end_POSTSUPERSCRIPT ∈ caligraphic_F.

Now we extend our main result Theorem 2 in this next result specializing to the Low Occupancy Complexity (43) bilinear model.

Corollary 3 (Cumulative Suboptimality of Theorem 2 in Low Occupancy Complexity (43) bilinear model).

Consider the Low Occupancy Complexity (43) bilinear model. Let Assumptions 4, 5, 6 and 8 hold and fix any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). Then, HyTQ algorithm policies {πk}k[K]subscriptsubscript𝜋𝑘𝑘delimited-[]𝐾\{\pi_{k}\}_{k\in[K]}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT satisfy

k=0K1(VπVπk)superscriptsubscript𝑘0𝐾1superscript𝑉superscript𝜋superscript𝑉subscript𝜋𝑘absent\displaystyle\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ 𝒪((ε,r+ε𝒢)K5/2H)𝒪subscript𝜀rsubscript𝜀𝒢superscript𝐾52𝐻\displaystyle\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)caligraphic_O ( ( square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG + italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT italic_H )
+𝒪~(max{C(π),1}dKH2(λ+H)log(HK|||𝒢|/δ)log(1+(K/d)))~𝒪𝐶superscript𝜋1𝑑𝐾superscript𝐻2𝜆𝐻𝐻𝐾𝒢𝛿1𝐾𝑑\displaystyle+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dKH^{2}}(% \lambda+H)\log(HK|\mathcal{F}||\mathcal{G}|/\delta)\sqrt{\log(1+(K/d))})+ over~ start_ARG caligraphic_O end_ARG ( roman_max { italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , 1 } square-root start_ARG italic_d italic_K italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_λ + italic_H ) roman_log ( italic_H italic_K | caligraphic_F | | caligraphic_G | / italic_δ ) square-root start_ARG roman_log ( 1 + ( italic_K / italic_d ) ) end_ARG )
+𝒪~(dKH4maxfνh(f)2s,aψh(s,a)2log(1+(K/d)))~𝒪𝑑𝐾superscript𝐻4subscript𝑓subscriptnormsubscript𝜈𝑓2subscriptnormsubscript𝑠𝑎subscript𝜓𝑠𝑎21𝐾𝑑\displaystyle+\widetilde{\mathcal{O}}(\sqrt{dKH^{4}}\max_{f\in\mathcal{F}}\|% \nu_{h}(f)\|_{2}\|\sum_{s,a}\psi_{h}(s,a)\|_{2}\sqrt{\log(1+(K/d))})+ over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_d italic_K italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG roman_log ( 1 + ( italic_K / italic_d ) ) end_ARG )

with probability at least 1δ1𝛿1-\delta1 - italic_δ. Now, consider the offline data distribution as in Assumption 11 with perfect robust Bellman completeness, i.e. ε,r=0=ε𝒢subscript𝜀r0subscript𝜀𝒢\varepsilon_{\mathcal{F},\mathrm{r}}=0=\varepsilon_{\mathcal{G}}italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT = 0 = italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT. We have C(π)suph,i[d](νh,i/νh,i(foff)).𝐶superscript𝜋subscriptsupremum𝑖delimited-[]𝑑superscriptsubscript𝜈𝑖subscript𝜈𝑖superscript𝑓offC(\pi^{*})\leq\sup_{h,i\in[d]}({\nu_{h,i}^{*}}/{\nu_{h,i}(f^{\mathrm{off}})}).italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ roman_sup start_POSTSUBSCRIPT italic_h , italic_i ∈ [ italic_d ] end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT / italic_ν start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT roman_off end_POSTSUPERSCRIPT ) ) .

Proof.

Using the Low Occupancy Complexity (43) bilinear model, we have 𝔼dPoh,πf[(𝒯ghfh+1𝒯fh+1)+]=Xh(f),Whd(f,g)subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜delimited-[]subscriptsubscript𝒯subscript𝑔subscript𝑓1𝒯subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊d𝑓𝑔\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(\mathcal{T}_{g_{h}}f_{h+1}-\mathcal{T}f_{h% +1})_{+}]=\left\langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangleblackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩, where

Xh(f)=νh(f),Whd(f,g)=(s,a)𝒮×𝒜ψh(s,a)((𝒯ghfh+1)(s,a)(𝒯fh+1)(s,a))+.formulae-sequencesubscript𝑋𝑓subscript𝜈𝑓subscriptsuperscript𝑊d𝑓𝑔subscript𝑠𝑎𝒮𝒜subscript𝜓𝑠𝑎subscriptsubscript𝒯subscript𝑔subscript𝑓1𝑠𝑎𝒯subscript𝑓1𝑠𝑎X_{h}(f)=\nu_{h}(f),\qquad W^{\mathrm{d}}_{h}(f,g)=\sum_{(s,a)\in\mathcal{S}% \times\mathcal{A}}\psi_{h}(s,a)((\mathcal{T}_{g_{h}}f_{h+1})(s,a)-(\mathcal{T}% f_{h+1})(s,a))_{+}.italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) = italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) = ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ( ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) - ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

We also have 𝔼dPoh,πf[(fh𝒯ghfh+1)+]=Xh(f),Whq(f,g)subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜delimited-[]subscriptsubscript𝑓subscript𝒯subscript𝑔subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊q𝑓𝑔\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(f_{h}-\mathcal{T}_{g_{h}}f_{h+1})_{+}]={% \left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle}blackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩, where

Whq(f,g)=(s,a)𝒮×𝒜ψh(s,a)(fh(s,a)(𝒯ghfh+1)(s,a))+.subscriptsuperscript𝑊q𝑓𝑔subscript𝑠𝑎𝒮𝒜subscript𝜓𝑠𝑎subscriptsubscript𝑓𝑠𝑎subscript𝒯subscript𝑔subscript𝑓1𝑠𝑎\qquad W^{\mathrm{q}}_{h}(f,g)=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\psi% _{h}(s,a)(f_{h}(s,a)-(\mathcal{T}_{g_{h}}f_{h+1})(s,a))_{+}.italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) = ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ( italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

Furthermore, we set BX=maxfνh(f)2subscript𝐵𝑋subscript𝑓subscriptnormsubscript𝜈𝑓2B_{X}=\max_{f\in\mathcal{F}}\|\nu_{h}(f)\|_{2}italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Since \mathcal{F}caligraphic_F is realizable and 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is complete, we set BW=Hs,aψh(s,a)2subscript𝐵𝑊𝐻subscriptnormsubscript𝑠𝑎subscript𝜓𝑠𝑎2B_{W}=H\|\sum_{s,a}\psi_{h}(s,a)\|_{2}italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = italic_H ∥ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then the result directly follows by Theorem 2.

For the second statement, first note that the occupancy dhπsubscriptsuperscript𝑑superscript𝜋d^{\pi^{*}}_{h}italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is low-rank as well since we assume perfect Bellman completeness. Following the proof of Lemma 8 we get

C(π)𝐶superscript𝜋\displaystyle C(\pi^{*})italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =maxfh=0H1𝔼s,adhπ[(𝒯fh+1(s,a)fh(s,a))+]h=0H1𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]absentsubscript𝑓superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑superscript𝜋delimited-[]subscript𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎\displaystyle=\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d% ^{\pi^{*}}_{h}}[(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a))_{+}]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}= roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG
maxfh=0H1𝔼s,adhπ[|𝒯fh+1(s,a)fh(s,a)|]h=0H1𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]absentsubscript𝑓superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑superscript𝜋delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎superscriptsubscript0𝐻1subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎\displaystyle\leq\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a% \sim d^{\pi^{*}}_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}≤ roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG
(a)maxf,h[H]𝔼s,adhπ[|𝒯fh+1(s,a)fh(s,a)|]𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]superscript𝑎absentsubscriptformulae-sequence𝑓delimited-[]𝐻subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑𝜋delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\max_{f\in\mathcal{F},h\in[H]% }\frac{\mathbb{E}_{s,a\sim d^{\pi}_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}% {\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F , italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] end_ARG
suph,s,adhπ(s,a)μh(s,a)(b)suph,i[d]νh,iνh,i(foff),absentsubscriptsupremum𝑠𝑎subscriptsuperscript𝑑𝜋𝑠𝑎subscript𝜇𝑠𝑎superscript𝑏subscriptsupremum𝑖delimited-[]𝑑superscriptsubscript𝜈𝑖subscript𝜈𝑖superscript𝑓off\displaystyle\leq\sup_{h,s,a}\frac{d^{\pi}_{h}(s,a)}{\mu_{h}(s,a)}\stackrel{{% \scriptstyle(b)}}{{\leq}}\sup_{h,i\in[d]}\frac{\nu_{h,i}^{*}}{\nu_{h,i}(f^{% \mathrm{off}})},≤ roman_sup start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT divide start_ARG italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP roman_sup start_POSTSUBSCRIPT italic_h , italic_i ∈ [ italic_d ] end_POSTSUBSCRIPT divide start_ARG italic_ν start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT roman_off end_POSTSUPERSCRIPT ) end_ARG ,

where (a),(b)𝑎𝑏(a),(b)( italic_a ) , ( italic_b ) follows from the Mediant inequality. This completes the proof. ∎

We now define a bilinear model called Low-rank Feature Selection Model (Du et al.,, 2021, Definition A.1). The nominal model Posuperscript𝑃𝑜P^{o}italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT is a low-rank feature selection model if it satisfies Ph,s,ao(s)=θh(s,a),ψh(s)subscriptsuperscript𝑃𝑜𝑠𝑎superscript𝑠subscript𝜃𝑠𝑎subscript𝜓superscript𝑠P^{o}_{h,s,a}(s^{\prime})=\langle\theta_{h}(s,a),\psi_{h}(s^{\prime})\rangleitalic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ⟨ italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩, for each h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] and all (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), with a (possibly unknown to the learner) map θ=(θh:𝒮×𝒜𝒴)\theta=(\theta_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{Y})italic_θ = ( italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → caligraphic_Y ) and a (possibly unknown to the learner) map ψh:𝒮𝒴:subscript𝜓maps-to𝒮𝒴\psi_{h}:\mathcal{S}\mapsto\mathcal{Y}italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S ↦ caligraphic_Y, where 𝒴𝒴\mathcal{Y}caligraphic_Y is a Hilbert space.

This model specializes to the kernel MDP model when the map θ𝜃\thetaitalic_θ is known to the learner (Jin et al., 2021a, , Definition 30). This model also specializes to the low-rank MDP model when 𝒴=d𝒴superscript𝑑\mathcal{Y}=\mathbb{R}^{d}caligraphic_Y = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (Huang et al.,, 2023, Assumption 1) and furthermore to linear MDP model when the map θ𝜃\thetaitalic_θ is also known to the learner (Du et al.,, 2021, Definition A.4).

We make the following assumption on the offline data-generating distribution (or policy by slight notational override for convenience).

Assumption 12.

Consider the Low-rank MDP Model (bilinear model). Let the offline data distribution μ={μh}h[H]𝜇subscriptsubscript𝜇delimited-[]𝐻\mu=\{\mu_{h}\}_{h\in[H]}italic_μ = { italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT satisfy maxh,s,aπh(a|s)/μh(a|s)αsubscript𝑠𝑎subscriptsuperscript𝜋conditional𝑎𝑠subscript𝜇conditional𝑎𝑠𝛼\max_{h,s,a}{\pi^{*}_{h}(a|s)}/{\mu_{h}(a|s)}\leq\alpharoman_max start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) / italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) ≤ italic_α and suppose that μ𝜇\muitalic_μ is induced by the nominal model, i.e. μ0(s)=d0(s)subscript𝜇0𝑠subscript𝑑0𝑠\mu_{0}(s)=d_{0}(s)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) = italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) (starting state distribution) and μh(s)=𝔼s,aμh1Ph1,s,ao(s)subscript𝜇𝑠subscript𝔼similar-tosuperscript𝑠superscript𝑎subscript𝜇1subscriptsuperscript𝑃𝑜1superscript𝑠superscript𝑎𝑠\mu_{h}(s)=\mathbb{E}_{s^{\prime},a^{\prime}\sim\mu_{h-1}}P^{o}_{h-1,s^{\prime% },a^{\prime}}(s)italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h - 1 , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) for any h11h\geq 1italic_h ≥ 1. Furthermore, suppose that μ𝜇\muitalic_μ satisfies that the feature covariance matrix Σμh1,θ=𝔼s,aμh1[θh(s,a)θh(s,a)]subscriptΣsubscript𝜇1𝜃subscript𝔼similar-to𝑠𝑎subscript𝜇1delimited-[]subscript𝜃𝑠𝑎subscript𝜃superscript𝑠𝑎top\Sigma_{\mu_{h-1},\theta}=\mathbb{E}_{s,a\sim\mu_{h-1}}[\theta_{h}(s,a)\theta_% {h}(s,a)^{\top}]roman_Σ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT , italic_θ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] is invertible for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] and 𝔼s,aμh[|𝒯fh+1(s,a)fh(s,a)|]1subscript𝔼similar-to𝑠𝑎subscript𝜇delimited-[]𝒯subscript𝑓1𝑠𝑎subscript𝑓𝑠𝑎1\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]\geq 1blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | ] ≥ 1 for at least one h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] and all f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F.

Now we extend our main result Theorem 2 in this next result specializing to the Low-rank Feature Selection Model bilinear model.

Corollary 4 (Cumulative Suboptimality of Theorem 2 in Low-rank Feature Selection Model (bilinear model)).

Consider the Low-rank Feature Selection Model (bilinear model). Let Assumptions 4, 5, 6 and 8 hold and fix any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). Then, HyTQ algorithm policies {πk}k[K]subscriptsubscript𝜋𝑘𝑘delimited-[]𝐾\{\pi_{k}\}_{k\in[K]}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT satisfy

k=0K1(VπVπk)superscriptsubscript𝑘0𝐾1superscript𝑉superscript𝜋superscript𝑉subscript𝜋𝑘absent\displaystyle\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ 𝒪((ε,r+ε𝒢)K5/2H)𝒪subscript𝜀rsubscript𝜀𝒢superscript𝐾52𝐻\displaystyle\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)caligraphic_O ( ( square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG + italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT italic_H )
+𝒪~(max{C(π),1}dKH2(λ+H)log(HK|||𝒢|/δ)log(1+(K/d)))~𝒪𝐶superscript𝜋1𝑑𝐾superscript𝐻2𝜆𝐻𝐻𝐾𝒢𝛿1𝐾𝑑\displaystyle+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dKH^{2}}(% \lambda+H)\log(HK|\mathcal{F}||\mathcal{G}|/\delta)\sqrt{\log(1+(K/d))})+ over~ start_ARG caligraphic_O end_ARG ( roman_max { italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , 1 } square-root start_ARG italic_d italic_K italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_λ + italic_H ) roman_log ( italic_H italic_K | caligraphic_F | | caligraphic_G | / italic_δ ) square-root start_ARG roman_log ( 1 + ( italic_K / italic_d ) ) end_ARG )
+𝒪~(dKH4s,aθh(s,a)2sψh(s)2log(1+(K/d)))~𝒪𝑑𝐾superscript𝐻4subscriptnormsubscript𝑠𝑎subscript𝜃𝑠𝑎2subscriptnormsubscript𝑠subscript𝜓𝑠21𝐾𝑑\displaystyle+\widetilde{\mathcal{O}}(\sqrt{dKH^{4}}\|\sum_{s,a}\theta_{h}(s,a% )\|_{2}\|\sum_{s}\psi_{h}(s)\|_{2}\sqrt{\log(1+(K/d))})+ over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_d italic_K italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG roman_log ( 1 + ( italic_K / italic_d ) ) end_ARG )

with probability at least 1δ1𝛿1-\delta1 - italic_δ. Now, consider the offline data distribution as in Assumption 12 with a low-rank MDP model. We have

C(π)𝐶superscript𝜋\displaystyle C(\pi^{*})italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) 2αHh=1H𝔼s,adPoh1,πθh(s,a)Σμh1,θ1+α.absent2𝛼𝐻superscriptsubscript1𝐻subscript𝔼similar-to𝑠𝑎subscriptsuperscript𝑑1superscript𝜋superscript𝑃𝑜subscriptnormsubscript𝜃𝑠𝑎superscriptsubscriptΣsubscript𝜇1𝜃1𝛼\displaystyle\leq\sqrt{2\alpha H}\sum_{h=1}^{H}\mathbb{E}_{s,a\sim d^{h-1,\pi^% {*}}_{P^{o}}}\left\|\theta_{h}(s,a)\right\|_{\Sigma_{\mu_{h-1},\theta}^{-1}}+% \sqrt{\alpha}.≤ square-root start_ARG 2 italic_α italic_H end_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_d start_POSTSUPERSCRIPT italic_h - 1 , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT , italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + square-root start_ARG italic_α end_ARG .
Proof.

We first begin with establishing a Q-value-dependent linearity property for the state-action-visitation measure dPoh,πf(s,a)subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜𝑠𝑎d^{h,\pi^{f}}_{P^{o}}(s,a)italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ). To do this, we adapt the proof of Huang et al., (2023, Lemma 17) here. We start by writing the state-visitation measure by recalling Eq. 35 here:

dPoh,πf(sh)subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜subscript𝑠\displaystyle d^{h,\pi^{f}}_{P^{o}}(s_{h})italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) =(s,a)𝒮×𝒜Ph,s,ao(sh)πh1f(a|s)dPoh1,πf(s)absentsubscript𝑠𝑎𝒮𝒜subscriptsuperscript𝑃𝑜𝑠𝑎subscript𝑠subscriptsuperscript𝜋𝑓1conditional𝑎𝑠subscriptsuperscript𝑑1superscript𝜋𝑓superscript𝑃𝑜𝑠\displaystyle=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}P^{o}_{h,s,a}(s_{h})% \pi^{f}_{h-1}(a|s)d^{h-1,\pi^{f}}_{P^{o}}(s)= ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) italic_d start_POSTSUPERSCRIPT italic_h - 1 , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s )
=(a)(s,a)𝒮×𝒜θh(s,a),ψh(sh)πh1f(a|s)dPoh1,πf(s)superscript𝑎absentsubscript𝑠𝑎𝒮𝒜subscript𝜃𝑠𝑎subscript𝜓subscript𝑠subscriptsuperscript𝜋𝑓1conditional𝑎𝑠subscriptsuperscript𝑑1superscript𝜋𝑓superscript𝑃𝑜𝑠\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{(s,a)\in\mathcal{S}\times% \mathcal{A}}\langle\theta_{h}(s,a),\psi_{h}(s_{h})\rangle\pi^{f}_{h-1}(a|s)d^{% h-1,\pi^{f}}_{P^{o}}(s)start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT ⟨ italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⟩ italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) italic_d start_POSTSUPERSCRIPT italic_h - 1 , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s )
=(s,a)𝒮×𝒜θh(s,a)πh1f(a|s)dPoh1,πf(s),ψh(sh)=ψh(sh),νh,πf(f),absentsubscript𝑠𝑎𝒮𝒜subscript𝜃𝑠𝑎subscriptsuperscript𝜋𝑓1conditional𝑎𝑠subscriptsuperscript𝑑1superscript𝜋𝑓superscript𝑃𝑜𝑠subscript𝜓subscript𝑠subscript𝜓subscript𝑠subscript𝜈superscript𝜋𝑓𝑓\displaystyle=\langle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\theta_{h}(s,a% )\pi^{f}_{h-1}(a|s)d^{h-1,\pi^{f}}_{P^{o}}(s),\psi_{h}(s_{h})\rangle=\langle% \psi_{h}(s_{h}),\nu_{h,\pi^{f}}(f)\rangle,= ⟨ ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) italic_d start_POSTSUPERSCRIPT italic_h - 1 , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) , italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⟩ = ⟨ italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , italic_ν start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ) ⟩ ,

where (a)𝑎(a)( italic_a ) follows by the low-rank feature selection model definition, and the last equality follows by taking a functional νh,πf(f)=s,aθh(s,a)πh1f(a|s)dPoh1,πf(s)subscript𝜈superscript𝜋𝑓𝑓subscript𝑠𝑎subscript𝜃𝑠𝑎subscriptsuperscript𝜋𝑓1conditional𝑎𝑠subscriptsuperscript𝑑1superscript𝜋𝑓superscript𝑃𝑜𝑠\nu_{h,\pi^{f}}(f)=\sum_{s,a}\theta_{h}(s,a)\pi^{f}_{h-1}(a|s)d^{h-1,\pi^{f}}_% {P^{o}}(s)italic_ν start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ) = ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) italic_d start_POSTSUPERSCRIPT italic_h - 1 , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ). Since we consider the finite action space with possibly large state space setting for our results, the state-action visitation measure for the deterministic non-stationary policy πfsuperscript𝜋𝑓\pi^{f}italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT is now given by dPoh,πf(sh,ah)=ψh,πf(sh,ah),νh,πf(f)subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜subscript𝑠subscript𝑎subscriptsuperscript𝜓superscript𝜋𝑓subscript𝑠subscript𝑎subscript𝜈superscript𝜋𝑓𝑓d^{h,\pi^{f}}_{P^{o}}(s_{h},a_{h})=\langle\psi^{\prime}_{h,\pi^{f}}(s_{h},a_{h% }),\nu_{h,\pi^{f}}(f)\rangleitalic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = ⟨ italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , italic_ν start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ) ⟩ with ψh,πf(sh,ah)=Cψh(sh)1{ah=πhf(s)}subscriptsuperscript𝜓superscript𝜋𝑓subscript𝑠subscript𝑎𝐶subscript𝜓subscript𝑠1subscript𝑎subscriptsuperscript𝜋𝑓𝑠\psi^{\prime}_{h,\pi^{f}}(s_{h},a_{h})=C\psi_{h}(s_{h})1\{a_{h}=\pi^{f}_{h}(s)\}italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = italic_C italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) 1 { italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) } for features ψh,πf:𝒮×𝒜𝒴:subscriptsuperscript𝜓superscript𝜋𝑓𝒮𝒜𝒴\psi^{\prime}_{h,\pi^{f}}:\mathcal{S}\times\mathcal{A}\to\mathcal{Y}italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → caligraphic_Y. Here C>0𝐶0C>0italic_C > 0 is a normalizing constant such that the state-action visitation measure is a probability measure.

We now have 𝔼dPoh,πf[(𝒯ghfh+1𝒯fh+1)+]=Xh(f),Whd(f,g)subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜delimited-[]subscriptsubscript𝒯subscript𝑔subscript𝑓1𝒯subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊d𝑓𝑔\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(\mathcal{T}_{g_{h}}f_{h+1}-\mathcal{T}f_{h% +1})_{+}]=\left\langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangleblackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩, where

Xh(f)=νh,πf(f),Whd(f,g)=(s,a)𝒮×𝒜ψh,πf(s,a)((𝒯ghfh+1)(s,a)(𝒯fh+1)(s,a))+.formulae-sequencesubscript𝑋𝑓subscript𝜈superscript𝜋𝑓𝑓subscriptsuperscript𝑊d𝑓𝑔subscript𝑠𝑎𝒮𝒜subscriptsuperscript𝜓superscript𝜋𝑓𝑠𝑎subscriptsubscript𝒯subscript𝑔subscript𝑓1𝑠𝑎𝒯subscript𝑓1𝑠𝑎X_{h}(f)=\nu_{h,\pi^{f}}(f),\qquad W^{\mathrm{d}}_{h}(f,g)=\sum_{(s,a)\in% \mathcal{S}\times\mathcal{A}}\psi^{\prime}_{h,\pi^{f}}(s,a)((\mathcal{T}_{g_{h% }}f_{h+1})(s,a)-(\mathcal{T}f_{h+1})(s,a))_{+}.italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) = italic_ν start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) = ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ( ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) - ( caligraphic_T italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

We also have 𝔼dPoh,πf[(fh𝒯ghfh+1)+]=Xh(f),Whq(f,g)subscript𝔼subscriptsuperscript𝑑superscript𝜋𝑓superscript𝑃𝑜delimited-[]subscriptsubscript𝑓subscript𝒯subscript𝑔subscript𝑓1subscript𝑋𝑓subscriptsuperscript𝑊q𝑓𝑔\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(f_{h}-\mathcal{T}_{g_{h}}f_{h+1})_{+}]={% \left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle}blackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = ⟨ italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) , italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) ⟩, where

Whq(f,g)=(s,a)𝒮×𝒜ψh,πf(s,a)(fh(s,a)(𝒯ghfh+1)(s,a))+.subscriptsuperscript𝑊q𝑓𝑔subscript𝑠𝑎𝒮𝒜subscriptsuperscript𝜓superscript𝜋𝑓𝑠𝑎subscriptsubscript𝑓𝑠𝑎subscript𝒯subscript𝑔subscript𝑓1𝑠𝑎\qquad W^{\mathrm{q}}_{h}(f,g)=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\psi% ^{\prime}_{h,\pi^{f}}(s,a)(f_{h}(s,a)-(\mathcal{T}_{g_{h}}f_{h+1})(s,a))_{+}.italic_W start_POSTSUPERSCRIPT roman_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f , italic_g ) = ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ( italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - ( caligraphic_T start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_a ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

Furthermore, we set

maxfνh(f)2=maxfs,aθh(s,a)πf(a|s)dPoh1,πf(s)2s,aθh(s,a)2=BX.\max_{f\in\mathcal{F}}\|\nu_{h}(f)\|_{2}=\max_{f\in\mathcal{F}}\|\sum_{s,a}% \theta_{h}(s,a)\pi^{f}(a|s)d^{h-1,\pi^{f}}_{P^{o}}(s)\|_{2}\leq\|\sum_{s,a}% \theta_{h}(s,a)\|_{2}=B_{X}.roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_a | italic_s ) italic_d start_POSTSUPERSCRIPT italic_h - 1 , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT .

Since \mathcal{F}caligraphic_F is realizable and 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is complete for all g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, we set

Hs,aψh,πf(s,a)2=HCs,aψh(s)1{a=πhf(s)}2HCsψh(s)2=BW.𝐻subscriptnormsubscript𝑠𝑎subscriptsuperscript𝜓superscript𝜋𝑓𝑠𝑎2𝐻𝐶subscriptnormsubscript𝑠𝑎subscript𝜓𝑠1𝑎subscriptsuperscript𝜋𝑓𝑠2𝐻𝐶subscriptnormsubscript𝑠subscript𝜓𝑠2subscript𝐵𝑊H\|\sum_{s,a}\psi^{\prime}_{h,\pi^{f}}(s,a)\|_{2}=HC\|\sum_{s,a}\psi_{h}(s)1\{% a=\pi^{f}_{h}(s)\}\|_{2}\leq HC\|\sum_{s}\psi_{h}(s)\|_{2}=B_{W}.italic_H ∥ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_H italic_C ∥ ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) 1 { italic_a = italic_π start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) } ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_H italic_C ∥ ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT .

Then the first result directly follows by Theorem 2. Following the proof of Song et al., (2023, Lemma 13) for our transfer coefficient C(π)𝐶superscript𝜋C(\pi^{*})italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), with the facts (xy)2|xy||x+y|superscript𝑥𝑦2𝑥𝑦𝑥𝑦(x-y)^{2}\leq|x-y||x+y|( italic_x - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ | italic_x - italic_y | | italic_x + italic_y | for x,y0𝑥𝑦0x,y\geq 0italic_x , italic_y ≥ 0 and fhHsubscriptnormsubscript𝑓𝐻\|f_{h}\|_{\infty}\leq H∥ italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_H for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], the last statement for C(π)𝐶superscript𝜋C(\pi^{*})italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) follows. This completes the proof. ∎

Now we extend our main result Theorem 2 in this next result to showcase sample complexity for comparisons with offline+online RL setting.

Corollary 5 (Offline+Online RL Sample Complexity of the HyTQ algorithm).

Let Assumptions 4, 5, 6, 7 and 8 hold. Fix any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and any ε>0𝜀0\varepsilon>0italic_ε > 0, and let Ntotsubscript𝑁totN_{\mathrm{tot}}italic_N start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT be the total number of sample tuples used in HyTQ algorithm. Then, the uniform policy π^^𝜋\widehat{\pi}over^ start_ARG italic_π end_ARG (uniform convex combination) of HyTQ algorithm policies {πk}k[K]subscriptsubscript𝜋𝑘𝑘delimited-[]𝐾\{\pi_{k}\}_{k\in[K]}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT satisfy, with probability at least 1δ1𝛿1-\delta1 - italic_δ,

VπVπ^ε, if NNtot=𝒪~(max{(C(π))2,1}dH3(λ+H)2ε2log2(H|||𝒢|/δ)).formulae-sequencesuperscript𝑉superscript𝜋superscript𝑉^𝜋𝜀 if 𝑁subscript𝑁tot~𝒪superscript𝐶superscript𝜋21𝑑superscript𝐻3superscript𝜆𝐻2superscript𝜀2superscript2𝐻𝒢𝛿\displaystyle V^{\pi^{*}}-V^{\widehat{\pi}}\leq\varepsilon,\quad\text{ if }N% \geq N_{\mathrm{tot}}=\widetilde{\mathcal{O}}(\frac{\max\{(C(\pi^{*}))^{2},1\}% dH^{3}(\lambda+H)^{2}}{\varepsilon^{2}}\log^{2}(H|\mathcal{F}||\mathcal{G}|/% \delta)).italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT ≤ italic_ε , if italic_N ≥ italic_N start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT = over~ start_ARG caligraphic_O end_ARG ( divide start_ARG roman_max { ( italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 1 } italic_d italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_λ + italic_H ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_H | caligraphic_F | | caligraphic_G | / italic_δ ) ) .
Proof.

This proof is straightforward from the Theorem 2 using a standard online-to-batch conversion (Shalev-Shwartz and Ben-David,, 2014, Theorem 14.8 & Chapter 21). Define the policy π^=Uniform{π0,,πK1}^𝜋Uniformsubscript𝜋0subscript𝜋𝐾1\widehat{\pi}=\text{Uniform}\{\pi_{0},\dots,\pi_{K-1}\}over^ start_ARG italic_π end_ARG = Uniform { italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT }. From Theorem 2, we get

0𝔼s0d0[V0π(s0)V0π^(s0)]=1Kk=0K1𝔼s0d0[V0π(s0)V0πk(s0)]0subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉^𝜋0subscript𝑠01𝐾superscriptsubscript𝑘0𝐾1subscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉subscript𝜋𝑘0subscript𝑠0\displaystyle 0\leq\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0})-V^{% \widehat{\pi}}_{0}(s_{0})]=\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_% {0}}[V^{\pi^{*}}_{0}(s_{0})-V^{\pi_{k}}_{0}(s_{0})]0 ≤ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ]
𝒪((ε,r+ε𝒢)K3/2H)+𝒪~(max{C(π),1}dH2/K(λ+H)log(HK|||𝒢|/δ)log(1+(K/d))).absent𝒪subscript𝜀rsubscript𝜀𝒢superscript𝐾32𝐻~𝒪𝐶superscript𝜋1𝑑superscript𝐻2𝐾𝜆𝐻𝐻𝐾𝒢𝛿1𝐾𝑑\displaystyle\leq\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{3/2}H)+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1% \}\sqrt{dH^{2}/K}(\lambda+H)\log(HK|\mathcal{F}||\mathcal{G}|/\delta)\sqrt{% \log(1+(K/d))}).≤ caligraphic_O ( ( square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG + italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_H ) + over~ start_ARG caligraphic_O end_ARG ( roman_max { italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , 1 } square-root start_ARG italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_K end_ARG ( italic_λ + italic_H ) roman_log ( italic_H italic_K | caligraphic_F | | caligraphic_G | / italic_δ ) square-root start_ARG roman_log ( 1 + ( italic_K / italic_d ) ) end_ARG ) .

We recall that our algorithm uses moffHsubscript𝑚off𝐻m_{\mathrm{off}}Hitalic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT italic_H number of offline samples and monHKsubscript𝑚on𝐻𝐾m_{\mathrm{on}}HKitalic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT italic_H italic_K number of on-policy samples in the datasets {𝒟hμ,𝒟h0,,𝒟hK1}subscriptsuperscript𝒟𝜇subscriptsuperscript𝒟0subscriptsuperscript𝒟𝐾1\{\mathcal{D}^{\mu}_{h},\mathcal{D}^{0}_{h},\cdots,\mathcal{D}^{K-1}_{h}\}{ caligraphic_D start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , ⋯ , caligraphic_D start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ]. Since we set mon=1subscript𝑚on1m_{\mathrm{on}}=1italic_m start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT = 1 and moff=Ksubscript𝑚off𝐾m_{\mathrm{off}}=Kitalic_m start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = italic_K, the total number of offline and on-policy samples is 2HK2𝐻𝐾2HK2 italic_H italic_K.

Fix any ε>0𝜀0\varepsilon>0italic_ε > 0. For approximations ε,r,ε𝒢subscript𝜀rsubscript𝜀𝒢\varepsilon_{\mathcal{F},\mathrm{r}},\varepsilon_{\mathcal{G}}italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, we first assume there exists K1=𝒪~(H4)subscript𝐾1~𝒪superscript𝐻4K_{1}=\widetilde{\mathcal{O}}(H^{4})italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over~ start_ARG caligraphic_O end_ARG ( italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) such that 𝒪((ε,r+ε𝒢)K3/2H)ε/2𝒪subscript𝜀rsubscript𝜀𝒢superscript𝐾32𝐻𝜀2\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+\varepsilon_{\mathcal% {G}})K^{3/2}H)\leq\varepsilon/2caligraphic_O ( ( square-root start_ARG italic_ε start_POSTSUBSCRIPT caligraphic_F , roman_r end_POSTSUBSCRIPT end_ARG + italic_ε start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_H ) ≤ italic_ε / 2 for all KK1𝐾subscript𝐾1K\geq K_{1}italic_K ≥ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Let

K2=𝒪~(max{(C(π))2,1}dH2(λ+H)2ε2log2(H|||𝒢|/δ)).subscript𝐾2~𝒪superscript𝐶superscript𝜋21𝑑superscript𝐻2superscript𝜆𝐻2superscript𝜀2superscript2𝐻𝒢𝛿K_{2}=\widetilde{\mathcal{O}}(\frac{\max\{(C(\pi^{*}))^{2},1\}dH^{2}(\lambda+H% )^{2}}{\varepsilon^{2}}\log^{2}(H|\mathcal{F}||\mathcal{G}|/\delta)).italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG caligraphic_O end_ARG ( divide start_ARG roman_max { ( italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 1 } italic_d italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_λ + italic_H ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_H | caligraphic_F | | caligraphic_G | / italic_δ ) ) .

Then, for KK1+K2𝐾subscript𝐾1subscript𝐾2K\geq K_{1}+K_{2}italic_K ≥ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have 𝔼s0d0[V0π(s0)V0π^(s0)]εsubscript𝔼similar-tosubscript𝑠0subscript𝑑0delimited-[]subscriptsuperscript𝑉superscript𝜋0subscript𝑠0subscriptsuperscript𝑉^𝜋0subscript𝑠0𝜀\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0})-V^{\widehat{\pi}}_{0}(s_{0% })]\leq\varepsilonblackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ italic_ε with probability at least 1δ1𝛿1-\delta1 - italic_δ. So, the total number of samples is at least Ntotsubscript𝑁totN_{\mathrm{tot}}italic_N start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT:

Ntot=2H(K1+K2)=𝒪~(max{(C(π))2,1}dH3(λ+H)2ε2log2(H|||𝒢|/δ)).subscript𝑁tot2𝐻subscript𝐾1subscript𝐾2~𝒪superscript𝐶superscript𝜋21𝑑superscript𝐻3superscript𝜆𝐻2superscript𝜀2superscript2𝐻𝒢𝛿N_{\mathrm{tot}}=2H(K_{1}+K_{2})=\widetilde{\mathcal{O}}(\frac{\max\{(C(\pi^{*% }))^{2},1\}dH^{3}(\lambda+H)^{2}}{\varepsilon^{2}}\log^{2}(H|\mathcal{F}||% \mathcal{G}|/\delta)).italic_N start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT = 2 italic_H ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = over~ start_ARG caligraphic_O end_ARG ( divide start_ARG roman_max { ( italic_C ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 1 } italic_d italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_λ + italic_H ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_H | caligraphic_F | | caligraphic_G | / italic_δ ) ) .

This completes the proof. ∎