Model-Free Robust $\varphi$ -Divergence Reinforcement Learning
Using Both Offline and Online Data

Kishan Panaganti, Adam Wierman, Eric Mazumdar
Computing + Mathematical Sciences Department, California Institute of Technology
Emails:{kpb, adamw, mazumdar}@caltech.edu

Abstract

The robust $\varphi$ -regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work ¹¹1To appear in the proceedings of the International Conference on Machine Learning (ICML) 2024. makes two important contributions. First, we propose a model-free algorithm called Robust $\varphi$ -regularized fitted Q-iteration (RPQ) for learning an $\varepsilon$ -optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of $\varphi$ -divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust $\varphi$ -regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust $\varphi$ -regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.

Keywords: Robust reinforcement learning, model uncertainty, general function approximation

1 Introduction

Online Reinforcement Learning (RL) agents learn through online interactions and exploration in environments and have been shown to perform well in structured domains such as Chess and Go (Silver et al.,, 2018), fast chip placements in semiconductors (Mirhoseini et al.,, 2021), fast transform computations in mathematics (Fawzi et al.,, 2022), and more. However, online RL agents (Botvinick et al.,, 2019) are known to suffer sample inefficiency due to complex exploration strategies in sophisticated environments. To overcome this, learning from available historical data has been studied using offline RL protocols (Levine et al.,, 2020). However, offline RL agents suffer from out-of-data-distribution (Yang et al.,, 2021; Robey et al.,, 2020) due to the lack of online exploration. Recent work Song et al., (2023) proposes another learning setting called hybrid RL that makes the best of both offline and online RL worlds. In particular, hybrid RL agents have access to both offline data (to reduce exploration overhead) and online interaction with the environment (to mitigate the out-of-data-distribution issue).

All three of these approaches (online, offline, and hybrid RL) require training environments (simulators) that closely represent real-world environments. However, time-varying real-world environments (Maraun,, 2016), sensor degradations (Chen et al.,, 1996), and other adversarial disturbances in practice (Pioch et al.,, 2009) mean that even high-fidelity simulators are not enough (Schmidt et al.,, 2015; Shah et al.,, 2018). RL agents are known to fail due to these mismatches between training and testing environments (Sünderhauf et al.,, 2018; Lesort et al.,, 2020). As a result, robust RL (Mankowitz et al.,, 2020; Panaganti and Kalathil, 2021a, ) has received increasing attention due to the potential for it to alleviate the issue of mismatches between the simulator and real-world environments.

Robust RL agents are built using the robust Markov Decision Process (RMDP) (Iyengar,, 2005; Nilim and El Ghaoui,, 2005) framework. In this framework, the goal is to find an optimal policy that is robust, i.e., performs uniformly well across a set of models (transition probability functions). This is formulated via a max-min problem, and the set of models is typically constructed around a simulator model (transition probability function) with some notion of divergence or distance function. We refer to the simulator model as any nominal model that is provided to RL agents.

The RMDP framework in RL is identical to the Distributionally Robust Optimization (DRO) framework in supervised learning (Duchi and Namkoong,, 2018; Chen et al.,, 2020). Similar to RMDP, DRO is a min-max problem aiming to minimize a loss function uniformly over the set of distributions constructed around the training distribution of the input space. However, developing model-free algorithms for DRO problems with general $\varphi$ -divergences (see Eq. 1) is known to be hard (Namkoong and Duchi,, 2016) due to their inherent non-linear and multi-level optimization structure. Additionally, developing model-free robust RL agents is also challenging (Iyengar,, 2005; Duchi and Namkoong,, 2018) for high-dimensional sequential decision-making systems under general function approximation.

To overcome this issue, in this work, we develop robust RL agents for the RRMDP framework, which is an equivalent alternative form of RMDP. A natural $\varphi$ -divergence regularization extension to the problem of RMDP gives way for this new RRMDP framework introduced in Yang et al., (2023); Zhang et al., (2023), under different names. It is built upon the penalized DRO problem (Levy et al.,, 2020; Jin et al., 2021b, ), that is, the $\varphi$ -divergence regularization version of the DRO problem. In particular, we focus on developing an offline robust RL algorithm for a class of $\varphi$ -divergences under the RRMDP framework with arbitrarily large state spaces, using only offline data with general function approximation. Towards this, as the first main contribution, we propose the Robust $\varphi$ -regularized fitted Q-iteration model-free algorithm and provide its performance guarantee for a class of $\varphi$ -divergences with a unified analysis. We refer to algorithms as model-free if they do not explicitly estimate the underlying nominal model. We address the following important (suboptimality and sample complexity) questions: What is the rate of suboptimality gap achieved between the optimal robust value and the value of RPQ policy? How many offline data samples from the nominal model are required to learn an $\varepsilon$ -optimal robust policy? We discuss challenges and present these results in Section 2.

{adjustwidth}

-1em

Algorithm	Algorithm-type	Data Coverage	Dataset Type	Robust	Suboptimality
(Panaganti et al.,, 2022, Alg.1)	FQI	all-policy	offline	TV	$\frac{V_{\max}^{3}\sqrt{\log(\|\mathcal{F}\|\|\mathcal{G}\|)}}{\rho N^{1/2}}$
(Zhang et al.,, 2023, Alg.1)	FQI	all-policy^∗	offline	KL	$\frac{\lambda V_{\max}^{2}\sqrt{\log(\|\mathcal{F}\|)}}{e^{-V_{\max}/\lambda}N^{% 1/2}}$
(Yang et al.,, 2023, Alg.2)	QL	uniform-policy	offline Markov	$\varphi$	$\frac{V_{\max}^{3}\sqrt{\log(\|\mathcal{S}\|\|\mathcal{A}\|)}}{d_{\min}^{3}c(% \lambda)N^{1/3}}$
RPQ (ours: Algorithm 1)	FQI	all-policy^∗	offline	$\varphi$	$\frac{V_{\max}^{3}\sqrt{\log(\|\mathcal{F}\|\|\mathcal{G}\|)}}{c(\lambda)N^{1/2}}$
HyTQ (ours: Algorithm 2)^†	FQI	single-policy	offline +	TV	$\frac{V_{\max}(\lambda+V_{\max})\log(\|\mathcal{F}\|\|\mathcal{G}\|)}{N^{1/2}}$
			online non-Markov

Table 1: Comparison of model-free $\varphi$ -divergence robust RL algorithms. In the algorithm-type column, Fitted Q-Iteration (FQI) uses least-squares regression and Q-Learning (QL) uses stochastic approximation updates. In the data coverage column, uniform-policy stipulates a data-generating policy to cover the entire state-action space. all-policy is where the data-generating policy should cover the state-action space covered by all non-stationary policies, and single-policy is where it covers the state-action space covered by the optimal robust policy, on the nominal model. ^∗ denotes the coverage should include all the models in robust sets designed by the divergences in the robust column. The dataset type column mentions the type of dataset collected with a data-generating policy for training corresponding algorithms where offline indicates i.i.d. historical dataset on the nominal model, offline Markov indicates Markovian dataset induced on the nominal model, and online non-Markov indicates a history dependent dataset as a collection of Markovian datasets induced on the nominal model by a set of learned policies. Finally, the suboptimality column is the statistical upper bound for the difference between the optimal robust value and the robust value achieved by the algorithm. Here

V_{\max}

is either

H

(1-\gamma)^{-1}

effective horizon factors.

\rho

is the robustness radius parameter in RMDPs and

\lambda

is the robustness penalization parameter in RRMDPs, which are inversely related (Yang et al.,, 2023, Theorem 3.1).

c(\lambda)

is some function on

\lambda

that varies according to different

\varphi

-divergences.

N

is the dataset size used by algorithms. ^† The bound of HyTQ is not directly comparable with others in terms of

V_{\max}

since the non-stationary finite-horizon setting requires

H

multiplicity in dataset size.

d_{\min}

is the minimal positive value of data generating stationary distribution

d

, i.e.

\min_{s,a}d(s,a)

\mathcal{F}

and

\mathcal{G}

are two function representations, and

(\mathcal{S},\mathcal{A})

is the state-action space.

In this work, we also develop and study a novel hybrid robust RL algorithm under the RRMDP framework using both offline data and online interactions with the nominal model. We make this second main contribution to this work since hybrid RL overcomes the out-of-data-distribution issue in offline RL. Towards this, we propose the Hybrid robust Total-variation-regularized Q-iteration algorithm and provide its performance guarantee under improved assumptions. Notably, the offline data-generating distribution must only cover the distribution that the optimal robust policy samples out on the nominal model, whereas before we needed it to cover any distribution uniformly. This is how online interactions help mitigate the out-of-data-distribution issue of offline RL and offline robust RL. We now address the cumulative suboptimality question in addition to sample complexity: What is the rate of cumulative suboptimality gap achieved between the optimal robust value and the value of HyTQ iteration policies? We discuss challenges and present these results in Section 3.

Related Work. Among all the previous works that provide model-free methods, here we only mention the ones closest to ours. We discuss more related works in Appendix A. Panaganti et al., (2022) proposed a Q-iteration offline robust RL algorithm in the RMDP framework only for the total variation $\varphi$ -divergence. Bruns-Smith and Zhou, (2023) proposed a Q-iteration offline robust RL algorithm in the RMDP framework to solve causal inference under unobserved confounders. Zhou et al., (2023) proposed an actor-critic robust RL algorithm in RMDP for integral probability metric. Zhang et al., (2023) proposed a Q-iteration offline robust RL algorithm in the RRMDP framework only for the Kullback-Leibler $\varphi$ -divergence. Blanchet et al., (2023) proposed specialized robust RL algorithms for the total variation and Kullback-Leibler $\varphi$ -divergences offering unified analyses for linear, kernels, and factored function approximation models under the finite state-action setting. Other line of work (Liu et al.,, 2022; Liang et al.,, 2023; Wang et al., 2023a, ; Wang et al., 2023b, ; Yang et al.,, 2023) provide model-free robust RL algorithms based on classical Q-learning methods in finite state-action spaces. We provide more insightful comparisons in Table 1. To the best of our knowledge, this is the first work that addresses a wide class of robust RL problems (like the general $\varphi$ -divergence) with arbitrary large state space using general function approximation under mild assumptions (like the robust Bellman error transfer coefficient).

Notation. We use the equality sign (=) for pointwise equality in vectors and matrices. For any $x\in\mathbb{R}$ , let $(x)_{+}=\max\{x,0\}$ . For any vector $x$ and positive semidefinite matrix $A$ , the squared matrix norm is $\|x\|_{A}^{2}=x^{\top}Ax$ . The set of probability distributions over $\mathcal{X}$ , with cardinality $|\mathcal{X}|$ , is denoted as $\Delta(\mathcal{X})$ , and its power set sigma algebra as $\Sigma(\mathcal{X})$ . For any function $f$ that takes $(s,a,r,s^{\prime})$ as input, define the expectation w.r.t. the dataset $\mathcal{D}$ (or empirical expectation) as $\mathbb{E}_{\mathcal{D}}[f(s_{i},a_{i},r_{i},s^{\prime}_{i})]=\frac{1}{N}\sum_% {(s_{i},a_{i},r_{i},s^{\prime}_{i})\in\mathcal{D}}f(s_{i},a_{i},r_{i},s^{% \prime}_{i})$ . For any positive integer $H$ , set $[H]$ denotes $\{0,1,\cdots,H-1\}$ . Define $\ell_{2}$ and $\ell_{1}$ norms as $\left\|x\right\|_{2,\mu}=\sqrt{\mathbb{E}_{\mu}[x^{2}]}$ and $\left\|x\right\|_{1,\mu}=\mathbb{E}_{\mu}[|x|]$ . $p\ll q$ denotes a probability distribution $p$ is absolutely continuous w.r.t a probability distribution $q$ . We use $\mathcal{O}(\cdot)$ to ignore universal constants less than $300$ and $\widetilde{\mathcal{O}}(\cdot)$ to ignore universal constants less than $300$ and the polylog terms depending on problem parameters.

2 Offline Robust $\varphi$ -Regularized Reinforcement Learning

We start with preliminaries and the problem formulation.

Infinite-Horizon Markov Decision Process: An infinite-horizon discounted Markov Decision Process ( $\gamma$ MDP) is a tuple $(\mathcal{S},\mathcal{A},R,P,\gamma,d_{0})$ where $\mathcal{S}$ is a countably large state-space, $\mathcal{A}$ is a finite set of actions, $R:\mathcal{S}\times\mathcal{A}\to[0,1]$ is a known stochastic reward function, $P\in\Delta(\mathcal{S})^{|\mathcal{S}||\mathcal{A}|}$ is a probability transition function describing an environment, $\gamma$ is a discount factor, and $d_{0}$ is the starting state distribution. A stationary (stochastic) policy $\pi:\mathcal{S}\to\Delta(\mathcal{A})$ specifies a distribution over actions in each state. We denote the transition dynamic distribution at state-action $(s,a)$ as $P_{s,a}\in\Delta(\mathcal{S})$ . For convenience, we write $r(s,a)=\mathbb{E}_{r\sim R(s,a)}[r]$ and assume it is deterministic as in RL literature (Agarwal et al.,, 2019) since the performance guarantee will be identical up to a constant factor.

The value function of a policy $\pi$ is $V^{\pi}_{P,r}(s)=\mathbb{E}_{P,\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}% )\;|\;s_{0}=s]$ starting at state $s_{0}=s$ and $a_{t}\sim\pi(s_{t}),s_{t+1}\sim P_{s_{t},a_{t}}$ for all $t\geq 0$ . Similarly, we define an action-value function of a policy $\pi$ as $Q^{\pi}_{P,r}(s,a)=\mathbb{E}_{P,\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{% t})\;|\;s_{0}=s,a_{0}=a].$ Each policy $\pi$ induces a discounted occupancy density over state-action pairs $d^{\pi}_{P}:\mathcal{S}\times\mathcal{A}\to[0,1]$ defined as $d^{\pi}_{P}(s,a)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}P_{t}(s_{t}=s,a_{t}=a;\pi)$ , where $P_{t}(s_{t}=s,a_{t}=a;\pi)$ denotes the visitation probability of state-action pair $(s,a)$ at time step $t$ , starting at $s_{0}\sim d_{0}(\cdot)$ and following $\pi$ on the model $P$ . The optimal policy $\pi^{*}_{P}$ achieves the maximum value of any policy $V^{\pi}_{P,r}$ .

Offline Reinforcement Learning: The goal of offline RL on $\gamma$ MDP $(P^{o},r)$ is to learn a good policy $\hat{\pi}$ (a policy with a high $V^{\hat{\pi}}_{P^{o},r}$ ) based only on the offline dataset. An offline dataset is a historical and fixed dataset of interactions $\mathcal{D}_{P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})\}_{i=1}^{N}$ , where $s_{i}^{\prime}\sim P^{o}_{s_{i},a_{i}}$ and the $(s_{i},a_{i})$ pairs are independently and identically generated according to a data distribution $\mu\in\Delta(\mathcal{S}\times\mathcal{A})$ . For convenience, $\mu$ also denotes the offline/behavior policy that generates $\mathcal{D}_{P^{o}}$ . One classical offline RL algorithm with general function approximation capabilities with provable performance guarantees is Fitted Q-Iteration (FQI) (Szepesvári and Munos,, 2005; Chen and Jiang,, 2019; Liu et al.,, 2020). A function class $\mathcal{F}=\{f:\mathcal{S}\times\mathcal{A}\to[0,1/(1-\gamma)]\}$ (e.g., neural networks, kernel functions, linear functions, etc) represents $Q$ -value functions of $\gamma$ MDP $(P^{o},r)$ . At each iteration, given $f_{k}\in\mathcal{F}$ and $\mathcal{D}_{P^{o}}$ , FQI does the following least-square regression for the approximate squared Bellman error: $f_{k+1}=\operatorname*{arg\,min}_{f\in\mathcal{F}}\mathbb{E}_{\mathcal{D}_{P^{% o}}}[(y_{f_{k}}-f)^{2}]$ , where $y_{f_{k}}(s,a,s^{\prime})=r(s,a)+\gamma\max_{b}f_{k}(s^{\prime},b)$ . In this regression step, FQI aims to find the optimal action-value $Q^{\pi^{*}}_{P^{o},r}$ by approximating the non-robust squared Bellman error ( $\|r+\gamma\mathbb{E}_{P^{o}}V^{\pi^{*}}_{P^{o},r}(\cdot)-Q^{\pi^{*}}_{P^{o},r}% \|_{2,\mu}^{2}$ ) using offline data $\mathcal{D}_{P^{o}}$ with function approximation $\mathcal{F}$ . Finally, for some starting state $s_{0}\sim d_{0}$ , the performance guarantee of an algorithm policy $\hat{\pi}$ is given by bounding the suboptimality quantity $0\leq V^{\pi^{*}}_{P^{o},r}(s_{0})-V^{\hat{\pi}}_{P^{o},r}(s_{0})$ .

Infinite-Horizon Robust $\varphi$ -Regularized Markov Decision Process: Let $P^{o}$ be the nominal model, that is, a probability transition function describing a training environment. An infinite-horizon discounted Robust $\varphi$ -Regularized Markov Decision Process ( $\gamma$ RRMDP) tuple $(\mathcal{S},\mathcal{A},r,P^{o},\lambda,\gamma,\varphi,d_{0})$ where $\lambda>0$ is a robustness parameter and $\varphi:\mathbb{R}\to\mathbb{R}$ is a convex function. The robust regularized reward function is defined as $r^{\lambda}_{P}(s,a)=r(s,a)+\lambda\gamma D_{\varphi}(P_{s,a},P^{o}_{s,a})$ for any state-action pairs and any $P$ such that $P_{s,a},P^{o}_{s,a}$ . Here $D_{\varphi}$ is the $\varphi$ -divergence (Csiszár,, 1967) defined as

\displaystyle D_{\varphi}(p,q)=\int\varphi\left(\frac{\mathrm{d}p}{\mathrm{d}q% }\right)\mathrm{d}q

(1)

for two probability distributions $p$ and $q$ with $p\ll q$ , where $\varphi$ is convex on $\mathbb{R}$ and differentiable on $\mathbb{R}_{+}$ satisfying $\varphi(1)=0$ and $\varphi(t)=+\infty$ for $t<0$ . Examples of $\varphi$ -divergence include Total Variation (TV), Kullback-Leibler (KL), chi-square, Conditional Value at Risk (CVaR), and more (c.f. Proposition 3). The robust regularized value function of a policy $\pi$ is defined as

\displaystyle V^{\pi}_{\lambda}=\inf_{P\in\mathcal{P}}V^{\pi}_{P,r^{\lambda}_{% P}},

(2)

where $\mathcal{P}=\otimes_{s,a}\mathcal{P}_{s,a}$ and $\mathcal{P}_{s,a}=\{P_{s,a}\in\Delta(\mathcal{S}):P_{s,a}\ll P^{o}_{s,a},% \forall(s,a)\in\mathcal{S}\times\mathcal{A}\}$ . By definition, for any $\pi$ , it follows that $V^{\pi}_{\lambda}\leq V^{\pi}_{P^{o},r}\leq 1/(1-\gamma)$ . The optimal robust regularized value function is $V^{*}_{\lambda}=\max_{\pi}V^{\pi}_{\lambda}$ (similarly we can design $Q^{*}_{\lambda}$ ), and $\pi^{*}$ is the robust regularized optimal policy that achieves this optimal value. For convenience, we denote $V^{*}_{\lambda}$ ( $Q^{*}_{\lambda}$ ) as $V^{*}$ ( $Q^{*}$ ). We note that $\mathcal{P}$ satisfies the $(s,a)$ -rectangularity condition (Iyengar,, 2005) by definition. This is a sufficient condition for the optimization problem in (2) to be tractable. It also enables the existence of a deterministic policy for $\pi^{*}$ (Yang et al.,, 2023). We formally mention this in Proposition 5. For any policy $\pi$ , denote $V^{\pi}=\mathbb{E}_{s\sim d_{0}}[V^{\pi}(s)]$ as the expected total reward with $d_{0}$ as initial state distribution.

Denote the robust regularized Bellman operator $\mathcal{T}:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\to\mathbb{R}^{\mathcal{S% }\times\mathcal{A}}$ as

\displaystyle(\mathcal{T}Q)(s,a)=r(s,a)+\gamma\inf_{P_{s,a}\in\mathcal{P}_{s,a% }}\big{(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})\big{)}.

(3)

Since $\mathcal{T}$ is a contraction (Yang et al.,, 2023), the robust Q-iteration (RQI) $Q_{k+1}=\mathcal{T}Q_{k}$ converges to $Q^{*}$ . We get the robust optimal policy as $\pi^{*}(s)=\operatorname*{arg\,max}_{a}Q^{*}(s,a)$ .

2.1 Problem Conceptualization

In this section, we study the offline infinite-horizon robust $\varphi$ -regularized RL ( $\gamma$ R³L) problem, acquiring useful insights to construct our algorithm (Algorithm 1) in next section. The goal here is to learn a good robust policy $\hat{\pi}$ (a policy with a high $V^{\hat{\pi}}_{\lambda}$ ) based on the offline dataset. We start by noting one key challenge in the estimation of the robust regularized Bellman operator $\mathcal{T}$ (3): One may require many offline datasets from each $P\in\mathcal{P}$ to achieve our offline $\gamma$ R³L goal. In this work, we use the penalized Distributionally Robust Optimization (DRO) tool (Sinha et al.,, 2017; Levy et al.,, 2020; Jin et al., 2021b, ) to not require such unrealistic existence of offline datasets. In particular, as in non-robust offline RL, we only rely on the offline dataset $\mathcal{D}_{P^{o}}$ generated on the nominal model $P^{o}$ by an offline policy $\mu$ . This statement is justified via the following proposition.

Proposition 1.

Consider a robust $\varphi$ -regularized MDP. For any $Q:\mathcal{S}\times\mathcal{A}\to[0,1/(1-\gamma)]$ , the robust regularized Bellman operator $\mathcal{T}$ (3) can be equivalently written as

\displaystyle(\mathcal{T}Q)(s,a)

\displaystyle=r(s,a)-\gamma\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{\prime}% \sim P^{o}_{s,a}}[\varphi^{*}\left({(\eta-V(s^{\prime}))}/{\lambda}\right)]-% \eta),

(4)

where $V(s)=\max_{a\in\mathcal{A}}Q(s,a)$ and $\Theta\subset\mathbb{R}$ is some bounded real line which depends on $\varphi^{*}$ .

A proof of this proposition is given in Appendix D and follows from Levy et al., (2020, Section A.1.2). We refer to (4) as the robust regularized Bellman dual operator. Observing the sole dependence on the nominal model $P^{o}$ in (4), one can come up with estimators for data-driven approaches that naturally depend only on the dataset $\mathcal{D}_{P^{o}}$ . We remark that we consider a class of $\varphi$ -divergences satisfying the conditions in Proposition 3 for all the results in this paper.

We now remark on a natural first attempt at performing the squared Bellman error least-square regression, like FQI, on the robust regularized Bellman dual operator (4). Observe that the true Bellman error $\mathbb{E}_{s,a\sim\mu}[|\mathcal{T}Q^{*}(s,a)-Q^{*}(s,a)|]$ involves solving an inner convex minimization problem in $\mathcal{T}Q^{*}(s,a)$ (4) for every $(s,a)$ . Since we are in a countably large state space regime, it is infeasible to devise approximations to this true squared Bellman error. In addition, we have to also enable general function architecture for action-values. To alleviate this challenging task, we now turn our attention to the inner convex minimization problem in the robust regularized Bellman dual operator (4). Due to the $(s,a)$ -rectangularity assumption, we note that the $\eta$ ’s are not correlated across all $(s,a)$ . With this note, for every $(s,a)$ , we can replace $\eta$ in $(\mathcal{T}Q)(s,a)$ (4) with a dual-variable function $g(s,a)$ . Thus, intuitively, multiple point-wise minimizations can be replaced by a single dual-variable functional minimization over the function space of $g$ . We formalize this intuition using variational functional analysis (Rockafellar and Wets,, 2009) for a countably large state space regime in the following.

We denote $L^{1}(\mu)$ as the set of all absolutely integrable functions defined on the probability (measure) space $(\mathcal{S}\times\mathcal{A},\Sigma(\mathcal{S}\times\mathcal{A}),\mu)$ with $\mu$ , the data generating distribution, as the $\sigma$ -finite probability measure. To elucidate, $L^{1}(\mu)$ is the set of all functions $g:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{C}\subset\mathbb{R}$ such that $\left\|g\right\|_{1,\mu}$ is finite. We set $\mathcal{C}=\Theta$ considering the inner minimization in (4). Fixing any given function $f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)]$ , we define the loss function $L_{\mathrm{dual}}(g;f)$ , for all $g\in L^{1}(\mu)$ , as

\displaystyle L_{\mathrm{dual}}(g;f,\mu)

\displaystyle=\mathbb{E}_{s,a\sim\mu,s^{\prime}\sim P^{o}_{s,a}}[\lambda% \varphi^{*}((g(s,a)-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))/{\lambda})-g(s,% a)].

(5)

We state the result for single dual-variable functional minimization intuition we developed in the previous paragraph. We also note one variant of this result appears in the distributionally robust RL work (Panaganti et al.,, 2022).

Proposition 2.

Let $L_{\mathrm{dual}}$ be the loss function defined in (5). Then, for any function $f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)]$ , we have

\displaystyle\inf_{g\in L^{1}(\mu)}L_{\mathrm{dual}}(g;f,\mu)=\mathbb{E}_{s,a% \sim\mu}\Big{[}\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s% ,a}}[\varphi^{*}({(\eta-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda})% ]-\eta)\Big{]}.

(6)

We provide a proof in Appendix D, which relies on Rockafellar and Wets, (2009, Theorem 14.60).

For any given $f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)]$ and $(s,a)\in\mathcal{S}\times\mathcal{A}$ , we define an operator $\mathcal{T}_{g}$ , for all $g\in L^{1}(\mu)$ , as

\displaystyle(\mathcal{T}_{g}f)(s,a)=r(s,a)-\gamma(\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{s,a}}[\varphi^{*}\left({(g(s,a)-V(s^{\prime}))}/{\lambda}% \right)]-g(s,a)).

(7)

This operator is useful in view of Propositions 1 and 2. To see this, we first define $g^{*}(Q)\in\operatorname*{arg\,min}_{g\in L^{1}(\mu)}L_{\mathrm{dual}}(g;Q,\mu)$ for any action-value function $Q$ . Now, by taking an expectation w.r.t the data generating distribution $\mu$ on (4), we observe $\mathcal{T}Q=\mathcal{T}_{g^{*}(Q)}Q$ by utilizing (6). Due to this observation, in the following subsection, we develop an algorithm by approximating both the optimal dual-variable function of optimal robust value $g^{*}(Q^{*})$ and the robust squared Bellman error ( $\|\mathcal{T}_{g^{*}(Q^{*})}Q^{*}-Q^{*}\|_{2,\mu}^{2}$ ) using offline data $\mathcal{D}_{P^{o}}$ . Panaganti et al., (2022) similarly conceptualized their total variation $\varphi$ -divergence robust RL algorithm. Here, Proposition 1 enables us to conceptualize for general $\varphi$ -divergence.

2.2 Robust $\varphi$ -regularized fitted Q-iteration

In this section, we formally propose our algorithm based on the tools developed so far. Our proposed algorithm is called Robust $\varphi$ -regularized fitted Q-iteration (RPQ) Algorithm and is summarized in Algorithm 1. We first discuss the inputs to our algorithm. As mentioned above, we only use the offline dataset $\mathcal{D}_{P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})\}_{i=1}^{N}$ , generated according to a data distribution $\mu$ on the nominal model $P^{o}$ . We also consider two general function classes $\mathcal{F}\subset(f:\mathcal{S}\times\mathcal{A}\rightarrow[0,1/(1-\gamma)])$ and $\mathcal{G}\subset(g:\mathcal{S}\times\mathcal{A}\rightarrow\Theta)$ representing action-value functions and dual-variable functions, respectively. We now define useful approximation quantities for $g\in\mathcal{G}$ and $f\in\mathcal{F}$ . For given $f$ , the empirical loss function of the true loss $L_{\mathrm{dual}}$ Eq. 5 on $\mathcal{D}_{P^{o}}$ is

\displaystyle\widehat{L}_{\mathrm{dual}}

\displaystyle(g;f)=\mathbb{E}_{\mathcal{D}_{P^{o}}}[\lambda\varphi^{*}((g(s_{i% },a_{i})-\max_{a^{\prime}}f(s_{i}^{\prime},a^{\prime}))/{\lambda})-g(s_{i},a_{% i})].

(8)

For given $f,g$ , the empirical squared robust regularized Bellman error on $\mathcal{D}_{P^{o}}$ is

\displaystyle\widehat{L}_{\mathrm{robQ}}(Q;f,g)=\mathbb{E}_{\mathcal{D}_{P^{o}% }}[([r(s_{i},a_{i})-\gamma\lambda\varphi^{*}((g(s_{i},a_{i})-\max_{a^{\prime}}% f(s_{i}^{\prime},a^{\prime}))/{\lambda})+\gamma g(s_{i},a_{i})-Q(s_{i},a_{i}))% ^{2}].

(9)

We start with an initial action-value function $Q_{0}(s,a)=0$ and execute the following two steps for $K$ iterations. At iteration $k$ of the algorithm with input $Q_{k}$ , as a first step, we compute a dual-variable function $g_{k}\in\mathcal{G}$ through the empirical risk minimization approach, that is, we solve $\operatorname*{arg\,min}_{g\in\mathcal{G}}\widehat{L}_{\mathrm{dual}}(g;Q_{k})$ (Line 4 of Algorithm 1). As a second step, given inputs $Q_{k}$ and $g_{k}$ , we compute the next iterate $Q_{k+1}\in\mathcal{F}$ through the least-squares regression method, that is, we solve $\operatorname*{arg\,min}_{f\in\mathcal{F}}\widehat{L}_{\mathrm{robQ}}(f;Q_{k},% g_{k})$ (Line 5 of Algorithm 1). After $K$ iterations, we extract the greedy policy from $Q_{K}$ (Line 7 of Algorithm 1).

Algorithm 1 Robust

\varphi

-regularized fitted Q-iteration (RPQ) Algorithm

1: Input: Regularization

\varphi

, offline dataset

\mathcal{D}_{P^{o}}=(s_{i},a_{i},r_{i},s^{\prime}_{i})_{i=1}^{N}

, general function classes

\mathcal{F}

and

\mathcal{G}

2: Initialize:

Q_{0}\equiv 0\in\mathcal{F}

3: for

k=0,\cdots,K-1

4: Dual variable function minimization:

g_{k}=\widehat{g}_{Q_{k}}=\operatorname*{arg\,min}_{g\in\mathcal{G}}\widehat{L% }_{\mathrm{dual}}(g;Q_{k})\;

(c.f. (8))

5: Robust

\varphi

-regularized Q-update:

Q_{k+1}=\operatorname*{arg\,min}_{Q\in\mathcal{F}}\widehat{L}_{\mathrm{robQ}}(% Q;Q_{k},g_{k})\;

(c.f. (9))

6: end for

7: Output:

\pi_{K}=\operatorname*{arg\,max}_{a}Q_{K}(s,a)

2.3 Performance Guarantee: Suboptimality

We now discuss the performance guarantee of our RPQ Algorithm. In particular, we characterize how close the robust regularized value function of our RPQ Algorithm is to the optimal robust regularized value function. We first mention all the assumptions about the data generating distribution $\mu$ and the representation power of $\mathcal{F}$ and $\mathcal{G}$ before we present our main results.

Assumption 1 (Concentrability).

There exists a finite constant $C>0$ such that for any $\nu\in\{d_{\pi,P}~{}|$ any policy $\pi$ and $P\in\mathcal{P}$ satisfying $D_{\varphi}(P_{s,a},P^{o}_{s,a})\leq 1/(\lambda(1-\gamma))$ for all $s,a$ (both can be non-stationary) $\}\subseteq\Delta(\mathcal{S}\times\mathcal{A})$ , we have $\left\|\nu/\mu\right\|_{\infty}\leq\sqrt{C}$ .

Assumption 1 stipulates the support set of the data generating distribution $\mu$ , i.e. $\{(s,a)\in\mathcal{S}\times\mathcal{A}:\mu(s,a)>0\}$ , to cover the union of all support sets of the distributions $\nu$ , leading to a robust exploratory behavior. This assumption is widely used in the offline RL literature (Munos,, 2003; Agarwal et al.,, 2019; Chen and Jiang,, 2019; Wang et al.,, 2021; Xie et al.,, 2021) in different forms. We adapt this assumption from the robust offline RL (Panaganti et al.,, 2022; Zhang et al.,, 2023).

Assumption 2 (Approximate Robust Bellman Completeness).

Let $\varepsilon_{\mathcal{F}}$ be some small positive constant. For any $g\in\mathcal{G}$ , we have $\sup_{f\in\mathcal{F}}\inf_{f^{\prime}\in\mathcal{F}}\|f^{\prime}-\mathcal{T}_% {g}f\|_{2,\mu}^{2}\leq\varepsilon_{\mathcal{F}}$ for the data generating distribution $\mu$ .

We note that Assumption 2 holds trivially if $\mathcal{T}_{g}$ is closed under $\mathcal{F}$ , that is, for any $f\in\mathcal{F}$ and $g\in\mathcal{G}$ , if it holds that $\mathcal{T}_{g}f\in\mathcal{F}$ , then $\varepsilon_{\mathcal{F}}=0$ . This assumption has been widely used in different forms in the non-robust offline RL literature (Agarwal et al.,, 2019; Wang et al.,, 2021; Xie et al.,, 2021) and robust offline RL literature (Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023; Zhang et al.,, 2023).

Assumption 3 (Approximate Dual Realizability).

For all $f\in\mathcal{F}$ , there exists a uniform constant $\varepsilon_{\mathcal{G}}$ such that $\inf_{g\in\mathcal{G}}L_{\mathrm{dual}}(g;f)-\inf_{g\in L^{1}(\mu)}L_{\mathrm{% dual}}(g;f)\leq\varepsilon_{\mathcal{G}}$ .

Assumption 3 holds trivially if $g^{*}(f)\in\mathcal{G}$ for any $f\in\mathcal{F}$ (since $\varepsilon_{\mathcal{G}}=0$ ). This assumption has been used in earlier robust offline RL literature (Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023).

Now we state our main theoretical result on the performance of the RPQ algorithm. In Appendix D we restate the result including the constant factors.

Theorem 1.

Let Assumptions 1, 2 and 3 hold. Let $c_{\varphi}(\lambda,\gamma)$ be problem-dependent constants for $\varphi$ . Let $\pi_{K}$ be the RPQ algorithm policy after $K$ iterations. Then, for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have

\displaystyle V^{\pi^{*}}-V^{\pi_{K}}\leq

\displaystyle\frac{\sqrt{C}(\gamma^{K}+\sqrt{6\varepsilon_{\mathcal{F}}}+% \gamma\varepsilon_{\mathcal{G}})}{(1-\gamma)^{2}}+\frac{c_{\varphi}(\lambda,% \gamma)}{(1-\gamma)^{3}}\mathcal{O}(\sqrt{{C\log(|\mathcal{F}||\mathcal{G}|/% \delta)}/{N}}).

Theorem 1 states that the RPQ algorithm is approximately optimal. This theorem also gives the sample complexity guarantee for finding an $\varepsilon$ -suboptimal policy w.r.t. the optimal policy $\pi^{*}$ . To see this, by neglecting the first term due to inevitable function class approximation errors, for $N\geq\mathcal{O}(\frac{(c_{\varphi}(\lambda,\gamma))^{2}}{\varepsilon^{2}(1-% \gamma)^{4}}\log\frac{|\mathcal{F}||\mathcal{G}|}{\delta})$ we get $V^{\pi^{*}}-V^{\pi_{K}}\leq{\varepsilon}/{(1-\gamma)}$ with probability at least $1-\delta$ for any fixed $\varepsilon,\delta\in(0,1)$ .

Remark 1.

Note that the guarantee for the TV case in Theorem 1 requires making another assumption on the existence of a fail-state (Panaganti et al.,, 2022, Lemma 3), Assumption 8 replacing $H$ with $1/(1-\gamma)$ . However, we specialize Theorem 1 for the TV case by relaxing Assumption 1 to get the same guarantee, which we present in Appendix D. In particular, we relax Assumption 1 to the non-robust offline RL concentrability assumption (Foster et al.,, 2022), i.e. we only need the distribution $\nu$ to be in the collection of discounted state-action occupancies on the nominal model $P^{o}$ .

3 Hybrid Robust $\varphi$ -Regularized Reinforcement Learning

In this section, we provide a hybrid robust $\varphi$ -Regularized RL protocol to overcome the out-of-data-distribution issue in offline robust RL. As in Song et al., (2023), we reformulate the problem in the finite-horizon setting to use its backward induction feature that enables RPQ iterates to run in each episode. We again start by discussing preliminaries and the problem formulation.

Finite-Horizon Markov Decision Process: A finite-horizon Markov Decision Process ( $h$ MDP) is $(\mathcal{S},\mathcal{A},P=(P_{h})_{h=0}^{H-1},r=(r_{h})_{h=0}^{H-1},{H})$ , where $H$ is the horizon length, for any $h\in[H]$ , $r_{h}:\mathcal{S}\times\mathcal{A}\to[0,1]$ is a known deterministic reward function and $P_{h}\in\Delta(\mathcal{S})^{|\mathcal{S}||\mathcal{A}|}$ is the transition probability function at time $h$ . A non-stationary (stochastic) policy $\pi=(\pi_{h})_{h=0}^{H-1}$ where $\pi_{h}:\mathcal{S}\to\Delta(\mathcal{A})$ . We denote the transition dynamic distribution at time $h$ and state-action $(s,a)$ as $P_{h,s,a}\in\Delta(\mathcal{S})$ . Given $\pi$ , we define the state and action value functions in the usual manner: $V^{h,\pi}_{P,r}(s)=\mathbb{E}[\sum_{t=h}^{H-1}r_{t}(s_{t},a_{t})|s_{h}=s]$ starting at state $s_{h}=s$ and $a_{t}\sim\pi_{t}(s_{t}),s_{t+1}\sim P_{t+1,s_{t},a_{t}}$ , and $Q^{h,\pi}_{P,r}(s,a)=\mathbb{E}[\sum_{t=h}^{H-1}r_{t}(s_{t},a_{t})|s_{h}=s,a_{% h}=a]$ starting at state-action $s_{h}=s,a_{h}=a$ and $s_{t+1}\sim P_{t+1,s_{t},a_{t}},a_{t+1}\sim\pi_{t+1}(s_{t+1})$ . Given $\pi$ , occupancy measure over state-action pairs $d^{h,\pi}_{P}(s,a)=P_{h}(s_{h}=s,a_{h}=a;\pi)$ . We write $\pi^{*}_{P}=(\pi^{*}_{h})_{h=0}^{H-1}$ to denote an optimal non-stationary deterministic policy, which maximizes $V^{\pi}_{P,r}=(V^{h,\pi}_{P,r})_{h=0}^{H-1}$ .

Hybrid Reinforcement Learning: The goal of hybrid RL on $h$ MDP $(P^{o},r)$ is to learn a good policy $\hat{\pi}$ based on adaptive datasets consisting of both offline datasets and on-policy datasets. Given timestep $h\in[H]$ , offline dataset $\mathcal{D}^{\mu}_{h,P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})_{i=1}^{m_{\mathrm{% off}}}\}$ is generated by $s_{i}^{\prime}\sim P^{o}_{h,s_{i},a_{i}}$ with the $(s_{i},a_{i})$ pairs i.i.d. sampled by $\mu_{h}\in\Delta(\mathcal{S}\times\mathcal{A})$ offline data distribution. For convenience, $\mu=(\mu_{h})_{h=0}^{H-1}$ also denotes the offline policy that generates $\mathcal{D}^{\mu}_{P^{o}}$ . Given timestep $h\in[H]$ , on-policy dataset $\mathcal{D}^{\pi}_{h,P^{o}}=\{(s_{i},a_{i},s_{i}^{\prime})_{i=1}^{m_{\mathrm{% on}}}\}$ is generated by $(s_{i},a_{i})\sim d^{h,\pi}_{P^{o}}$ and $s_{i}^{\prime}\sim P^{o}_{h,s_{i},a_{i}}$ for all the previously learned policies $\pi$ by the algorithm. Song et al., (2023) proposes Hybrid Q-learning (HyQ) algorithm with general function approximation capabilities and provable guarantees for hybrid RL. The HyQ algorithm (c.f. Song et al., (2023, Algorithm 1)) is quite straightforward: For each iteration $k\in[K]$ , do backward induction of the FQI algorithm on timesteps $h\in[H]$ using the adaptive datasets described above. Finally, for some starting state $s_{0}\sim d_{0}$ , the performance guarantee of algorithm policies $\{\pi_{k}\}_{k\in[K]}$ is given by bounding the cumulative suboptimality quantity $0\leq\sum_{k=[K]}[V^{0,\pi^{*}}_{P^{o},r}(s_{0})-V^{0,\pi_{k}}_{P^{o},r}(s_{0})]$ . We note the total adaptive dataset size is $N$ to provide comparable results with offline RL.

Finite-Horizon Robust $\varphi$ -Regularized Markov Decision Process: Again, let $P^{o}$ be the nominal model. A finite-horizon discounted Robust $\varphi$ -Regularized Markov Decision Process ( $h$ RRMDP) tuple $(\mathcal{S},\mathcal{A},P^{o}=(P^{o}_{h})_{h=0}^{H-1},r=(r_{h})_{h=0}^{H-1},% \lambda,H,\varphi,d_{0})$ where $\lambda>0$ is a robustness parameter and $\varphi:\mathbb{R}\to\mathbb{R}$ is as before. For $h\in[H]$ , the robust regularized reward function is $r^{\lambda}_{h}(s,a)=r_{h}(s,a)+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})$ . For $h\in[H]$ , the robust regularized value function of a policy $\pi$ is defined as $V^{\pi}_{h,\lambda}=\inf_{P\in\mathcal{P}}V^{h,\pi}_{P,r^{\lambda}_{h}},$ where $\mathcal{P}=\otimes_{h,s,a}\mathcal{P}_{h,s,a}$ and $\mathcal{P}_{h,s,a}=\{P_{h,s,a}\in\Delta(\mathcal{S}):P_{h,s,a}\ll P^{o}_{h,s,% a},\forall(s,a)\in\mathcal{S}\times\mathcal{A}\text{ and }h\in[H]\}$ . By definition, for any $\pi$ , it follows that $V^{\pi}_{h,\lambda}\leq V^{h,\pi}_{P^{o},r}\leq H$ . For $h\in[H]$ , the optimal robust regularized value function is $V^{*}_{h,\lambda}=\max_{\pi}V^{\pi}_{h,\lambda}$ , and $\pi^{*}$ is the robust regularized optimal policy that achieves this optimal value. For convenience, we denote $V^{*}_{h,\lambda}$ ( $Q^{*}_{h,\lambda}$ ) as $V^{*}_{h}$ ( $Q^{*}_{h}$ ) for all $h\in[H]$ . We again note that, for each $h\in[H]$ , $\mathcal{P}$ satisfies the $(s,a)$ -rectangularity condition (Iyengar,, 2005) by definition. It enables the existence of a non-stationary deterministic policy for $\pi^{*}$ (Zhang et al.,, 2023). We formalize this in Proposition 6. We denote $V^{\pi}=\mathbb{E}_{s\sim d_{0}}[V^{\pi}_{0}(s)]$ as the expected total reward.

For convenience, we let $Q_{H,\lambda}^{\pi}=0$ for any $\pi$ . For any $h\in[H]$ , denote the robust regularized Bellman operator $\mathcal{T}:\mathbb{R}^{\mathcal{S}\times\mathcal{A}}\to\mathbb{R}^{\mathcal{S% }\times\mathcal{A}}$ as

\displaystyle(\mathcal{T}Q_{h+1})(s,a)=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P% }_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[\max_{a^{\prime}}Q_{h+1% }(s^{\prime},a^{\prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}.

(10)

As $Q^{*}_{H}=0$ , doing backward iteration of $\mathcal{T}$ , i.e., the robust dynamic programming $Q^{*}_{h}=\mathcal{T}Q^{*}_{h+1}$ , we get $Q^{*}_{h}$ for all $h\in[H]$ . For each timestep $h\in[H]$ , we also get the robust optimal policy as $\pi^{*}_{h}(s)=\operatorname*{arg\,max}_{a}Q^{*}_{h}(s,a)$ .

3.1 Problem Conceptualization

In this section, we study the hybrid finite-horizon robust TV-regularized RL problem, acquiring the necessary insights to construct our algorithm (Algorithm 2) in the next section. We conceptualize for general $\varphi$ -divergence, but only propose our algorithm for total variation $\varphi$ -divergence. The goal here is to learn a good robust policy $\hat{\pi}$ based on adaptive datasets consisting of both offline datasets and on-policy datasets. We start by noting a direct consequence of Proposition 1 due to similar inner minimization problems in both infinite horizon (3) and finite horizon (10) operators.

Corollary 1.

For any $Q_{h}:\mathcal{S}\times\mathcal{A}\to[0,H]$ and $h\in[H]$ , the robust regularized Bellman operator $\mathcal{T}$ (10) can be equivalently written as

\displaystyle(\mathcal{T}

\displaystyle Q_{h+1})(s,a)=r_{h}(s,a)-\gamma\inf_{\eta\in\Theta}(\lambda% \mathbb{E}_{s^{\prime}\sim P^{o}_{h,s,a}}[\varphi^{*}\left({(\eta-V_{h+1}(s^{% \prime}))}/{\lambda}\right)]-\eta),

(11)

where $V_{h+1}(s)=\max_{a\in\mathcal{A}}Q_{h+1}(s,a)$ and $\Theta\subset\mathbb{R}$ is some bounded real line that depends on $\varphi^{*}$ .

As in Section 2, this dual reformulation enables us to use the datasets from only the nominal model $P^{o}$ for estimating the robust regularized operator in its primal form (10).

We start by recalling the philosophy of the HyQ algorithm (Song et al.,, 2023) to use the FQI algorithm for adaptive datasets. We do the same for our hybrid finite-horizon robust $\varphi$ -regularized RL problem here. For each $h\in[H]$ , we need to estimate the true Bellman error $\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}Q^{*}_{h+1}(s,a)-Q^{*}_{h}(s,a)|]+\sum% _{t=0}^{k-1}\mathbb{E}_{s,a\sim d^{\pi_{t}}_{h,P^{o}}}[|\mathcal{T}Q^{*}_{h+1}% (s,a)-Q^{*}_{h}(s,a)|]$ using offline dataset from $\mu_{h}$ and the on-policy dataset from $d^{\pi_{t}}_{h,P^{o}}$ by the learned policies from the algorithm. We remark that the out-of-data-distribution issue appears when we only have access to the offline dataset to estimate the summation term above, which depends on $d^{\pi_{t}}_{h,P^{o}}$ .

As discussed in Section 2, the true Bellman error itself involves solving an inner convex minimization problem in $\mathcal{T}Q^{*}_{h+1}(s,a)$ (11) for every $(s,a)$ and $h$ that is challenging for countably large state setting. To alleviate this challenging task, we again utilize the functional minimization Proposition 2 developed in Section 2. For any $h$ , we denote the set of admissible distributions of nominal model $P^{o}$ as $\mathbb{D}_{h}=\{\mu_{h}\}\cup\{d^{\pi}_{h,P^{o}}\,|\text{\,for any policy (% including non-stationary)\,}\pi\}$ . Now we redefine dual loss for any $f_{h+1}\in\mathcal{F}_{h+1},\nu_{h}\in\mathbb{D}_{h}$ , as

\displaystyle L_{\mathrm{dual}}

\displaystyle(g;f_{h+1},\nu_{h})=\mathbb{E}_{s,a\sim\nu_{h},s^{\prime}\sim P^{% o}_{h,s,a}}[\lambda\varphi^{*}((g(s,a)-\max_{a^{\prime}}f_{h+1}(s^{\prime},a^{% \prime}))/{\lambda})-g(s,a)].

(12)

We state a direct consequence of Proposition 2 here.

Corollary 2.

Let $L_{\mathrm{dual}}$ be the loss function defined in (12). Fix $h\in[H]$ and consider any policy $\pi$ . Then, for any function $f_{h+1}:\mathcal{S}\times\mathcal{A}\rightarrow[0,H]$ and any $\nu_{h}\in\mathbb{D}_{h}$ , we have

\displaystyle\inf_{g\in L^{1}(\nu_{h})}L_{\mathrm{dual}}(g;f_{h+1},\nu_{h})=% \mathbb{E}_{s,a\sim\nu_{h}}\Big{[}\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{h,s,a}}[\varphi^{*}({(\eta-\max_{a^{\prime}}f_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})]-\eta)\Big{]}.

(13)

For any given $f_{h}:\mathcal{S}\times\mathcal{A}\rightarrow[0,H]$ and $h$ , we redefine operator $\mathcal{T}_{g}$ for all $g\in\mathcal{G}_{h}$ , as

\displaystyle(\mathcal{T}_{g}f_{h+1})(s,a)=r_{h}(s,a)-\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{h,s,a}}[\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})]+g(s,a).

(14)

We have all the necessary tools now. In the following subsection, we develop an algorithm that naturally extends our RPQ algorithm using adaptive datasets.

3.2 Hybrid Robust regularized Q-iteration

In this section, we propose our algorithm based on the tools developed so far. Our proposed algorithm is called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q) Algorithm, summarized in Algorithm 2. The total variation $D_{\mathrm{TV}}$ $\varphi$ -divergence (1) is defined with $\varphi(t)=|t-1|/2$ . The inputs to this algorithm are the offline dataset, and two general function classes $\mathcal{F}=\otimes_{h\in[H]}\mathcal{F}_{h},\mathcal{G}=\otimes_{h\in[H]}% \mathcal{G}_{h}$ . For any $h\in[H]$ , $\mathcal{F}_{h}\subset(f:\mathcal{S}\times\mathcal{A}\rightarrow[0,H])$ and $\mathcal{G}_{h}\subset(g:\mathcal{S}\times\mathcal{A}\rightarrow[0,\lambda])$ represent action-value functions and dual-variable functions at $h$ , respectively. We redefine, using (17), the empirical dual loss and the robust empirical squared robust regularized Bellman error for dataset $\mathcal{D}$ as

		$\displaystyle\widehat{L}_{\mathrm{dual}}(g;f,\mathcal{D})=\mathbb{E}_{\mathcal% {D}}[(g(s_{i},a_{i})-\max_{a^{\prime}}f(s_{i}^{\prime},a^{\prime}))_{+}-g(s_{i% },a_{i})]\quad\text{and}$		(15)
		$\displaystyle\widehat{L}_{\mathrm{robQ}}(Q;f,g,\mathcal{D})=\mathbb{E}_{% \mathcal{D}}[([r_{h}(s_{i},a_{i})-(g(s_{i},a_{i})-\max_{a^{\prime}}f(s_{i}^{% \prime},a^{\prime}))_{+}+g(s_{i},a_{i})-Q(s_{i},a_{i}))^{2}].$		(16)

Algorithm 2 HyTQ Algorithm

1: Input: Offline dataset

\mathcal{D}^{\mu}_{h}\sim\mu_{h}

of size

m_{\mathrm{off}}=T

for

h\in[H]

, general function classes

\mathcal{F}

and

\mathcal{G}

2: Initialize:

Q^{0}_{h}\equiv 0\in\mathcal{F}_{h}

3: for

k=0,\cdots,K-1

4: Compute

\pi_{k}

\pi_{k,h}(s)=\operatorname*{arg\,max}_{a}Q^{k}_{h}(s,a)

\forall h

, collect

m_{\mathrm{on}}

=

1

online dataset

\mathcal{D}^{k}_{h}\sim d_{h,P^{o}}^{\pi_{k}}

6: Initialize:

Q^{k+1}_{H}\equiv 0\in\mathcal{F}_{H}

7: for

h=H-1,\cdots,0

8: Aggregate adaptive dataset

\mathcal{D}^{k}_{h}=\mathcal{D}^{\mu}_{h}+\sum_{\tau=0}^{k}\mathcal{D}^{\tau}_% {h}

9: Dual variable function minimization: (c.f. (15))

g^{k+1}_{h}=\operatorname*{arg\,min}_{g\in\mathcal{G}_{h}}\widehat{L}_{\mathrm% {dual}}(g;Q^{k+1}_{h+1},\mathcal{D}^{k}_{h})

10: Robust

\varphi

-regularized Q-update: (c.f. (16))

Q^{k+1}_{h}=\operatorname*{arg\,min}_{Q\in\mathcal{F}_{h}}\widehat{L}_{\mathrm% {robQ}}(Q;Q^{k+1}_{h+1},g^{k+1}_{h},\mathcal{D}^{k}_{h})

11: end for

12: end for

3.3 Cumulative Suboptimality Guarantee

We now discuss the performance guarantee in terms of the cumulative suboptimality of our HyTQ Algorithm. We first mention all the assumptions before we present our main result and add a brief discussion. We provide detailed discussion in Section 4.

Assumption 4 (Robust Bellman Error Transfer Coefficient).

Let $\mu_{h}\in\Delta(\mathcal{S}\times\mathcal{A})$ be the offline data generating distribution. For any $f\in\mathcal{F}$ , there exists a small positive constant $C(\pi^{*})$ for the optimal policy $\pi^{*}$ that satisfies

\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{h,\pi^{*}}_{P^{o}}}[\mathcal{T}f_% {h+1}(s,a)-f_{h}(s,a)]}{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{% T}f_{h+1}(s,a)-f_{h}(s,a)|]}\leq C(\pi^{*}).

We develop this assumption from non-robust offline RL work (Song et al.,, 2023).

Assumption 5 (Approximate Value Realizability and Robust Bellman Completeness).

Let $\varepsilon_{\mathcal{F},\mathrm{r}}$ $\geq$ $0$ be small constant. For any $h\in[H]$ and $g_{h}\in\mathcal{G}_{h}$ , we have $\inf_{f\in\mathcal{F}_{h}}\sup_{\nu_{h}}\|f-\mathcal{T}_{g_{h}}f_{h+1}\|_{2,% \nu_{h}}^{2}\leq\varepsilon_{\mathcal{F},r}$ for all $\nu_{h}\in\mathbb{D}_{h}$ . Furthermore, for any $f_{h+1}\in\mathcal{F}_{h+1}$ , we have $\mathcal{T}_{g_{h}}f_{h+1}\in\mathcal{F}_{h}$ .

Assumption 6 (Approximate Dual Realizability).

Let $\varepsilon_{\mathcal{G}}$ be some small positive constant. For any $h\in[H]$ and $f_{h+1}\in\mathcal{F}_{h+1}$ , we have $\inf_{g\in\mathcal{G}_{h}}L_{\mathrm{dual}}(g;f_{h+1},\nu_{h})-\inf_{g\in L^{1% }(\nu_{h})}L_{\mathrm{dual}}(g;f_{h+1},\nu_{h})\leq\varepsilon_{\mathcal{G}}$ , for all $\nu_{h}\in\mathbb{D}_{h}$ .

We adapt these two enhanced realizability assumptions from the non-robust offline RL literature (Xie et al.,, 2021; Foster et al.,, 2022; Song et al.,, 2023) to our problem. The assumptions in Section 2 are not directly comparable, but for the sake of exposition, let $\mathcal{F}_{h},\mathcal{G}_{h}$ be the same across $h$ . First, note that Assumption 3 with all-policy concentrability (Assumption 1) is equivalent to Assumption 6. Second, Assumption 2 implies $\inf_{f\in\mathcal{F}}\|f-\mathcal{T}_{g}f\|_{2,\mu}^{2}\leq\varepsilon_{% \mathcal{F}}$ . Now again, with all-policy concentrability (Assumption 1), it is the approximate value realizability (Assumption 5). We know non-robust offline RL is hard (Foster et al.,, 2022) with just realizability and all-policy concentrability. As robust RL is at least as hard as its non-robust counterpart (Panaganti and Kalathil,, 2022), we also assume Bellman completeness in Assumption 5.

Assumption 7 (Bilinear Models).

Consider any $f\in\mathcal{F},g\in\mathcal{G}$ and $h\in[H]$ . Let $\pi^{f}$ be greedy policy w.r.t $f$ . There exists an unknown feature mapping $X_{h}:\mathcal{F}\mapsto\mathbb{R}^{d}$ and two unknown weight mappings $W^{\mathrm{q}}_{h},W^{\mathrm{d}}_{h}:\mathcal{F}\times\mathcal{G}\mapsto% \mathbb{R}^{d}$ with $\max_{f}\|X_{h}(f)\|_{2}\leq B_{X}$ and $\max_{f,g}\max\{\|W^{\mathrm{q}}_{h}(f,g)\|_{2},\|W^{\mathrm{d}}_{h}(f,g)\|_{2% }\}\leq B_{W}$ such that both $\mathbb{E}_{d^{\pi^{f}}_{h}}[(f_{h}(s,a)-T_{g_{h}}f_{h+1})_{+}]=\left\lvert% \left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle\right\rvert$ and $\mathbb{E}_{d^{\pi^{f}}_{h}}[(T_{g_{h}}f_{h+1}-Tf_{h+1})_{+}]=\left\lvert\left% \langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangle\right\rvert$ holds.

We adapt this problem architecture assumption on $P^{o}$ with $\mathcal{F}$ and $\mathcal{G}$ for our setting from a series of non-robust online RL works (Jin et al., 2021a, ; Du et al.,, 2021).

Assumption 8 (Fail-state).

There is a fail state $s_{f,h}$ for all $h\in[H]$ , such that $r_{h}(s_{f},a)=0$ and $P_{h,s_{f},a}(s_{f,h})=1$ , for all $a\in\mathcal{A}$ and $P\in\mathcal{P}$ satisfying $D_{\mathrm{TV}}(P_{h^{\prime},s^{\prime},a^{\prime}},P^{o}_{h^{\prime},s^{% \prime},a^{\prime}})\leq\max\{1,H/\lambda\}$ for all $h^{\prime},s^{\prime},a^{\prime}$ .

This assumption enables us to ground the value of such $P$ ’s at $s_{f,h}$ to zero, which helps us to get a tight duality (c.f. (17)) without having to know the minimum value across large $\mathcal{S}$ . There are approximations to this in the literature (Wang and Zou,, 2022). But we adopt this less restrictive assumption from Panaganti et al., (2022) for convenience.

Now we state our main theoretical result on the performance of the HyTQ algorithm. The proof is presented in Appendix E.

Theorem 2.

Let Assumptions 4, 5, 6, 7 and 8 hold. Fix any $\delta\in(0,1)$ . Then, HyTQ algorithm policies $\{\pi_{k}\}_{k\in[K]}$ satisfy $\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq\widetilde{\mathcal{O}}(\sqrt{% \varepsilon_{\mathcal{F},\mathrm{r}}}+\varepsilon_{\mathcal{G}})+\widetilde{% \mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dH^{2}K}(\lambda+H)\log(|\mathcal{F}||% \mathcal{G}|/\delta))$ with probability at least $1-\delta$ .

Remark 2.

We specialize this result for bilinear model examples, linear occupancy complexity model (Du et al.,, 2021, Definition 4.7) and low-rank feature selection model (Du et al.,, 2021, Definition A.1), in Section E.2. We also specialize this result using standard online-to-batch conversion (Shalev-Shwartz and Ben-David,, 2014) for uniform policy over HyTQ policies $\{\pi_{k}\}_{k\in[K]}$ to provide sample complexity $\widetilde{\mathcal{O}}(\max\{(C(\pi^{*}))^{2},1\}dH^{3}(\lambda+H)^{2}(\log(|% \mathcal{F}||\mathcal{G}|/\delta))^{2})/{\varepsilon^{2}}$ in the Section E.2.

4 Theoretical Discussions and Final Remarks

In this section, we first discuss the proof ideas for our results, focusing on discussions of the assumptions and their improvements. Next, we compare our results with the most relevant ones from the robust RL literature. Our Table 1 should be used as a reference. Finally, we discuss the bilinear model architecture in detail, as ours is the first work to consider it in the robust RL setting under the general function architecture for the value and dual functions approximations.

Discussions on Proof Sketch: We first discuss our RPQ algorithm (Algorithm 1) result. We note that the concentrability (Assumption 1) assumption requires the data-generating policy to be robust exploratory. That is, it covers the state-action occupancy induced by any policy and any $\varphi$ -divergence set transition model. We reiterate the proof idea of the suboptimality result (Panaganti et al.,, 2022, Theorem 1) of the RFQI algorithm (Panaganti et al.,, 2022, Algorithm 1). We highlight the most important differences with Panaganti et al., (2022); Zhang et al., (2023) here. Firsty, we generalize the robust performance lemma ( $\mathbb{E}_{s_{0}\sim d_{0}}[{V}^{\pi^{*}}]-\mathbb{E}_{s_{0}\sim d_{0}}[V^{% \pi_{K}}]\leq 2\|Q^{\pi^{*}}-Q_{K}\|_{1,\nu}/(1-\gamma)$ at Eq. 26) for any general $\varphi$ -divergence problem. Secondly, we identify that it is hard to come up with a unified analysis for general $\varphi$ -divergences in robust RL setting via the dual reformulation of the distributionally robust optimization problem (Duchi and Namkoong,, 2018, Proposition 1). Thus, a direct extension of the results in Panaganti et al., (2022) is hard for general $\varphi$ -divergences. By RPQ analyses, we showcase that it is indeed possible to get a unified analysis for the robust RL problem using the RRMDP framework. Thirdly, we show the generalization bounds for the empirical risk minimization (Proposition 7) and least squares (Proposition 8) estimators for general $\varphi$ -divergences with unified results. By these three points, equipped with the more general robust exploratory concentrability (Assumption 1), we have a unified general $\varphi$ -divergences suboptimality result (Theorem 1) for the RPQ algorithm.

We now discuss our HyTQ algorithm (Algorithm 2) result. We immediately make an important note here. The concentrability assumption improvement is two-fold: all-policy concentrability (Assumption 9) to single concentrability, and then to the robust Bellman error transfer coefficient (Assumption 4) via Lemma 8. We refer to Foster et al., (2022); Song et al., (2023) for further discussion on such concentrability assumption improvements and tightness in the non-robust offline RL. We leave it to future work for more tightness of these assumptions in the robust RL setting. We execute a tighter analysis in our HyTQ algorithm result (Theorem 2) compared to our RPQ algorithm TV $\varphi$ -divergence specialized result (Theorem 4). We summarize the steps as follows:
Step $(a)$ : We meticulously arrive at the following robust performance lemma (c.f. Eqs. 37 and 39) for each algorithm iteration $k$ : $\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0})-V^{\pi_{k}}_{0}(s_{0})]% \leq\textstyle\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{*}}_{h}}[(\mathcal{T% }Q^{k}_{h+1}(s,a)-Q^{k}_{h}(s,a))_{+}]+\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{% \pi_{k}}_{h}}[(Q^{k}_{h}(s,a)-\mathcal{T}Q^{k}_{h+1}(s,a))_{+}].$ We highlight that the first summand here depends on the samples from state-action occupancy of the optimal robust policy and for the second summand it is the w.r.t. the learned HyTQ policies. It is now intuitive to connect the first summand with the offline samples and the second with the online samples.
Finally, step $(b)$ : With the above gathered intuition, firstly, the history dependent dataset collected by different offline data-generating policy and the learned HyTQ policies on the nominal model $P^{o}$ warrants more sophisticated generalization bounds for the empirical risk minimization and least squares estimators. We prove a generalization bound for empirical risk minimization when the data are not necessarily i.i.d. but adapted to a stochastic process in Appendix C. This result is applicable to more machine learning problems outside of the scope of this paper as well. Finally, equipped with the transfer coefficient (Assumption 4) and bilinear model (Assumption 7) assumptions for the nominal model $P^{o}$ , we formally show generalization bounds for the empirical risk minimization and least squares estimators in Propositions 9 and 10 respectively.
We complete the proof by combining these two steps.

Remark 3.

We offer computational tractability in our RPQ and HyTQ algorithms due to the usage of empirical risk minimization (Steps 4 & 9 resp.), over the general function class $\mathcal{G}$ , and least-squares (Steps 5 & 10 resp.), over the general function class $\mathcal{F}$ , computationally tractable estimators. This two-step estimator update avoids the complexity of solving the inner problem for each state-action pair (leading to scaling issues for high-dimensional problems) in the original robust Bellman operators (Eqs. 3 and 10). To the best of our knowledge, no purely online or purely offline robust RL algorithms are known to be tractable in this sense, except other robust Q-iteration and actor-critic methods (discussed in Table 1) and except under much stronger coverage conditions (like single-policy and uniform) in the tabular setting.

Theoretical Guarantee Discussions: In the suboptimality result (Theorem 1) for the RPQ algorithm (Algorithm 1), we only mention the leading statistical bound with a problem-dependent (on $\varphi$ -divergence) constant $c_{\varphi}(\lambda,\gamma)$ . We provide the exact constants pertaining to different $\varphi$ -divergences in a restated statement of Theorem 1 in Theorem 3. Furthermore, the constants $c_{1},c_{2},c_{3}$ in Theorem 3 take different values for different $\varphi$ -divergences provided in Proposition 3. Similarly, for the suboptimality result (Theorem 2) of the HyTQ algorithm (Algorithm 2), we provide a more detailed bound in a restated statement in Theorem 5.

In the following we provide comparisons of suboptimality results with relevant prior works. But first, we make an important note here on $\rho$ , the robustness radius parameter in RMDPs, and $\lambda$ , the robustness penalization parameter in RRMDPs, mentioned briefly in Table 1. (Levy et al.,, 2020; Yang et al.,, 2023) establish the regularized and constrained versions of DRO and robust MDP problems, respectively, are equivalent by connecting their respective ( $\lambda$ and $\rho$ ) robustness parameters. Moreover, both observe rigorously that $\lambda$ and $\rho$ are inversely related. This is intuitively true, as $\lambda\to\infty$ and $\rho\to 0$ both yield the non-robust solutions on the nominal model $P^{o}$ and as $\lambda\to 0$ and $\rho\to\infty$ both yield the conservative solutions considering the entire probability simplex for the transition dynamics. However, it is an interesting open problem to establish an exact analytical relation between the robustness parameters $\lambda$ and $\rho$ . We leave this to future research as it is out of the scope of this work.

Here we specialize our result (Theorem 3) for the chi-square $\varphi$ -divergence $\gamma$ R³L problem. We get the suboptimality for the RPQ algorithm as $\widetilde{\mathcal{O}}\left(\frac{\max\{\frac{1}{\lambda(1-\gamma)^{2}},% \lambda\}\sqrt{C\log({|\mathcal{F}||\mathcal{G}|})}}{(1-\gamma)^{2}\sqrt{N}}\right)$ , where we only have presented the higher-order terms. The suboptimality of Algorithm 2 in Yang et al., (2023, Theorem 5.1) for chi-square $\varphi$ -divergence is stated for $\lambda=1/(1-\gamma)$ as $\widetilde{\mathcal{O}}\left(\frac{\max\{\frac{1}{(1-\gamma)^{2}},\sqrt{\log(|% \mathcal{S}||\mathcal{A}|)}\}}{d^{3}_{\min}(1-\gamma)^{3}N^{1/3}}\right)$ where $d_{\min}$ is described in Table 1. We use the typical equivalence from RL literature for comparison between these two results in the tabular setting with generative/simulator modeling assumption: function approximation classes with full dimension yields $\log(\mathcal{|F||G|})=O(|\mathcal{S}||\mathcal{A}|)$ (Panaganti et al.,, 2022) and uniform support data sampling yields $\mu_{\min}=1/(|\mathcal{S}||\mathcal{A}|)$ and $C\leq|\mathcal{S}||\mathcal{A}|$ (Shi et al.,, 2023). Now our result with $\lambda=1/(1-\gamma)$ reduces to $\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}% \sqrt{N}}\right)$ and their result (Yang et al.,, 2023) reduces to $\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}|^{3}|\mathcal{A}|^{3}\max\{% \frac{1}{(1-\gamma)^{2}},\sqrt{\log(|\mathcal{S}||\mathcal{A}|)}\}}{(1-\gamma)% ^{3}N^{1/3}}\right)$ . Two comments warrant attention here. Firstly, compared to a model-based robust regularized algorithm (robust value iteration using empirical estimates of the nominal model $P^{o}$ ) (Yang et al.,, 2023, Theorem 3.2), our suboptimality bound is worse off by the factors $\sqrt{|\mathcal{S}||\mathcal{A}|}$ and $1/(1-\gamma)$ . We leave it to future work to fine-tune and get optimal rates. Secondly, their result Yang et al., (2023, Theorem 5.1) exhibit inferior performance compared to ours in all parameters, but we do want to note that they make a first attempt to give suboptimality bounds for the stochastic approximation-based algorithm. The dependence on $|\mathcal{S}||\mathcal{A}|$ is typically known to be bad using the stochastic approximation technical tool (Chen et al.,, 2022), and Yang et al., (2023, Discussion on Page 16) conjectures using the Polyak-averaging technique to improve their suboptimality bound rate to $N^{-1/2}$ .

Here we discuss and compare our result for the total variation $\varphi$ -divergence setting. As mentioned in Remark 1, we have a specialized result in Section D.2 for the total variation $\varphi$ -divergence. We get the suboptimality result (Theorem 4) for the RPQ algorithm as $\widetilde{\mathcal{O}}\left(\frac{\lambda\sqrt{C_{\mathrm{tv}}\log({|\mathcal% {F}||\mathcal{G}|})}}{(1-\gamma)^{3}\sqrt{N}}\right)$ , where we again only have presented the higher-order terms. Panaganti et al., (2022, Theorem 1) mentioned in Table 1 also exhibits same suboptimality guarantee replacing $\lambda$ with $\rho^{-1}$ . As we noted before, $\rho$ (the robustness radius parameter in RMDPs) and $\lambda$ (the robustness penalization parameter in RRMDPs) are inversely related, and for the TV $\varphi$ -divergence we observe a straightforward relation between the two as $\lambda=\rho^{-1}$ . Using the earlier arguments for a tabular setting bound, our result further reduces to $\widetilde{\mathcal{O}}\left(\frac{\lambda|\mathcal{S}||\mathcal{A}|}{(1-% \gamma)^{3}\sqrt{N}}\right)$ . Now comparing this to the minimax lower bound (Shi et al.,, 2023, Theorem 2), our suboptimality bound is worse off by the factors $\sqrt{|\mathcal{S}||\mathcal{A}|}$ and $1/(1-\gamma)$ . Nevertheless, we push the boundaries by providing novel suboptimality guarantee studying the robust RL problem in the hybrid RL setting. Furthermore, as mentioned earlier in Remark 2, we provide the offline+online robust RL suboptimality guarantee $\widetilde{\mathcal{O}}\left({\max\{C(\pi^{*}),1\}\sqrt{dH^{3}}(\lambda+H)\log% (|\mathcal{F}||\mathcal{G}|/\delta)}/{\sqrt{N}}\right)$ in the Appendix E. We also remark that the HyTQ algorithm can be proposed under the RMDP setting with a similar suboptimality guarantee due to the similarity of the dual Bellman equations under the TV $\varphi$ -divergence for RMDPs and RRMDPs (c.f. Eq. 33 and Xu^∗ et al., (2023, Lemma 8)). For the sake of consistency and novelty, we present our results solely for the RRMDP setting. As mentioned earlier, the concentrability assumption improvement is two-fold (Lemma 8): all-policy concentrability (Assumption 9) to single concentrability to transfer coefficient. This is the first of its kind result that does not yet have any existing lower bounds to compare in the robust RL setting. Under similar transfer coefficient, Bellman completeness, and bilinear model assumptions, the HyTQ algorithm sample complexity (Corollary 5) is comparable to that of a non-robust RL algorithm (Song et al.,, 2023), i.e., $\widetilde{\mathcal{O}}({\max\{(C(\pi^{*}))^{2},1\}dH^{5}}\log(H|\mathcal{F}|/% \delta)/{\varepsilon^{2}})$ . We leave it to future work for developing minimax rates and getting optimal algorithm guarantees.

Here we specialize our result (Theorem 3) for the KL $\varphi$ -divergence $\gamma$ R³L problem. We get the suboptimality for RPQ as $\widetilde{\mathcal{O}}\left(\frac{(\lambda+(1-\gamma)^{-1})\exp\{(\lambda(1-% \gamma))^{-1}\}\sqrt{C\log({|\mathcal{F}||\mathcal{G}|})}}{(1-\gamma)^{2}\sqrt% {N}}\right)$ , where we only have presented the higher-order terms. Using the earlier arguments for a tabular setting bound, our result with $\lambda=1/(1-\gamma)$ again reduces to $\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}% \sqrt{N}}\right)$ . Zhang et al., (2023, Theorem 5) mentioned in Table 1 also exhibits same suboptimality guarantee. Two remarks are in order here. Firstly, we remark that our RPQ algorithm and its theoretical guarantee unifies for a class of $\varphi$ -divergence classes, whereas Zhang et al., (2023, Algorithm 1) is specialized for the KL $\varphi$ -divergence. This steers towards our first main contribution discussed in Section 1. Secondly, we remark the robust regularized Bellman operator Eq. 3 for the KL $\varphi$ -divergence has a special form due to the existence of an analytical worse-case transition model. This arrives at a special structure of the form of an exponential robust Bellman operator in a Q-value-variant space. This special structure helps avoid the dual variable function update (Step 4) in the RPQ algorithm and the $\log(|\mathcal{G}|)$ factor in the suboptimal guarantee. We choose not to include this specialized result in this work (like we did for the TV $\varphi$ -divergence in Section D.2) and directly point to Zhang et al., (2023). We do highlight here an important note for such a choice in our paper. The abovementioned special structure forces us to get online samples from all the transition kernels (c.f. Assumption 1), which is unrealistic in practice, to achieve an improvement in the hybrid robust RL setting. We leave it to future work for developing such improved algorithm guarantees in the hybrid robust RL setting for other $\varphi$ -divergences.

Discussion of Bilinear Models in the Hybrid Robust RL setting: We emphasize that while our bilinear model for the HyTQ algorithm is specialized to low occupancy complexity (i.e. the occupancy measures themselves have a low-rank structure) and low-rank feature selection model (i.e. the nominal model $P^{o}$ has a low-rank structure) in Section E.2, the function classes $\mathcal{F}$ (Q-value representations) and $\mathcal{G}$ (dual-value representations) can be arbitrary, potentially nonlinear function classes (neural tangent kernels, neural networks, etc). Thus, even in the tabular setting with large state space (e.g. $|\mathcal{S}|>O(10^{5})$ ) for the bilinear model, our suboptimality bounds only scale with the complexity of the function classes $\mathcal{F}$ and $\mathcal{G}$ , which can considerably be low compared to $|\mathcal{S}|$ . For example, linear function approximators (e.g. linear feature dimension $d=\log(\mathcal{|F||G|})\ll|\mathcal{S}||\mathcal{A}|$ ), RKHS approximators with low dimension features, neural tangent kernels with low effective neural net dimension, and more function approximators. Moreover, our work solves the robust RL problem with more nuances, which is at least as hard as the non-robust RL problem. Thus, due to the new upcoming research status of robust RL in the general function approximation setting, we believe it is currently out of scope for this work to satisfy more general bilinear model classes (Du et al.,, 2021). Nevertheless, our initial findings for robust RL by the HyTQ algorithm in the hybrid learning setting reveal the hardness of finding larger model classes for RRMDPs with general $\varphi$ -divergences.

We conclude this section with an exciting future research direction that remains unsolved in this paper. To solve the hybrid robust RL problem for general $\varphi$ -divergence. In this work, we noticed while building hybrid learning for robust RL that one would require online samples from the worse-case model (c.f. the model that solves the inner problem in robust Bellman operator Eq. 10) for general $\varphi$ -divergences due to the current analyses dependent on the bilinear models. We use the dual reformulation for the total variation $\varphi$ -divergence and provide current results supporting the HyTQ algorithm. We remark that using the same approach for other general $\varphi$ -divergences, we get exponential dependence on the horizon factor. This warrants more sophisticated algorithm designs for the hybrid robust RL problem under general $\varphi$ -divergences.

5 Conclusion

In this work, we presented two robust RL algorithms. We proposed Robust $\varphi$ -divergence-fitted Q-iteration algorithm for general $\varphi$ -divergence in the offline RL setting. We provided performance guarantees with unified analysis for all $\varphi$ -divergences with arbitrarily large state space using function approximation. To mitigate the out-of-data-distribution issue by improving the assumptions on data generation, we proposed a novel framework called hybrid robust RL that uses both offline and online interactions. We proposed the Total-variation-divergence Q-iteration algorithm in this framework with an accompanying guarantee. We have provided our theoretical guarantees in terms of suboptimality and sample complexity for both offline and offline+online robust RL settings. We also rigorously specialized our results to different $\varphi$ -divergences and different bilinear modeling assumptions. We have provided detailed comparisons with relevant prior works while also discussing interesting future directions in the field of robust reinforcement learning.

Acknowledgment

KP acknowledges support from the ‘PIMCO Postdoctoral Fellow in Data Science’ fellowship at the California Institute of Technology. This work acknowledges support from NSF CNS-2146814, CPS-2136197, CNS-2106403, NGSDI-2105648, and funding from the Resnick Institute. EM acknowledges support from NSF award 2240110. We thank several anonymous ICML 2024 reviewers for their constructive comments on an earlier draft of this paper.

References

Agarwal et al., (2019) Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
Antos et al., (2008) Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
Bertsimas et al., (2018) Bertsimas, D., Gupta, V., and Kallus, N. (2018). Data-driven robust optimization. Math. Program., 167(2):235–292.
Blanchet et al., (2019) Blanchet, J., Kang, Y., and Murthy, K. (2019). Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857.
Blanchet et al., (2023) Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. Advances in Neural Information Processing Systems, 36.
Botvinick et al., (2019) Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in cognitive sciences, 23(5):408–422.
Bruns-Smith and Zhou, (2023) Bruns-Smith, D. and Zhou, A. (2023). Robust fitted-q-evaluation and iteration under sequentially exogenous unobserved confounders. arXiv preprint arXiv:2302.00662.
Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051.
Chen et al., (1996) Chen, J., Patton, R. J., and Zhang, H.-Y. (1996). Design of unknown input observers and robust fault detection filters. International Journal of control, 63(1):85–105.
Chen et al., (2020) Chen, R., Paschalidis, I. C., et al. (2020). Distributionally robust learning. Foundations and Trends® in Optimization, 4(1-2):1–243.
Chen et al., (2022) Chen, Z., Khodadadian, S., and Maguluri, S. T. (2022). Finite-sample analysis of off-policy natural actor–critic with linear function approximation. IEEE Control Systems Letters, 6:2611–2616.
Corporation, (2021) Corporation, N. (2021). Closing the sim2real gap with nvidia isaac sim and nvidia isaac replicator.
Csiszár, (1967) Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318.
Du et al., (2021) Du, S., Kakade, S., Lee, J., Lovett, S., Mahajan, G., Sun, W., and Wang, R. (2021). Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836.
Duchi and Namkoong, (2018) Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
Farahmand et al., (2010) Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
Fawzi et al., (2022) Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. (2022). Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53.
Foster et al., (2022) Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. (2022). Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919.
Fujimoto and Gu, (2021) Fujimoto, S. and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145.
Fujimoto et al., (2019) Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062.
Gao and Kleywegt, (2022) Gao, R. and Kleywegt, A. (2022). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research.
Huang et al., (2023) Huang, A., Chen, J., and Jiang, N. (2023). Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning, pages 13710–13752.
Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
(24) Jin, C., Liu, Q., and Miryoosefi, S. (2021a). Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418.
(25) Jin, J., Zhang, B., Wang, H., and Wang, L. (2021b). Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34:2771–2782.
Kostrikov et al., (2021) Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. (2021). Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783.
Kumar et al., (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784–11794.
Kumar et al., (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
Lange et al., (2012) Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer.
Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
Lesort et al., (2020) Lesort, T., Lomonaco, V., Stoian, A., Maltoni, D., Filliat, D., and Díaz-Rodríguez, N. (2020). Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information fusion, 58:52–68.
Levine et al., (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
Levy et al., (2020) Levy, D., Carmon, Y., Duchi, J. C., and Sidford, A. (2020). Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860.
Liang et al., (2023) Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023). Single-trajectory distributionally robust reinforcement learning. arXiv preprint arXiv:2301.11721.
Liu et al., (2020) Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020). Provably good batch off-policy reinforcement learning without great exploration. In Neural Information Processing Systems.
Liu et al., (2022) Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust $q$ -learning. In International Conference on Machine Learning, pages 13623–13643.
Mankowitz et al., (2020) Mankowitz, D. J., Levine, N., Jeong, R., Abdolmaleki, A., Springenberg, J. T., Shi, Y., Kay, J., Hester, T., Mann, T., and Riedmiller, M. (2020). Robust reinforcement learning for continuous control with model misspecification. In International Conference on Learning Representations.
Mannor et al., (2016) Mannor, S., Mebel, O., and Xu, H. (2016). Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509.
Maraun, (2016) Maraun, D. (2016). Bias correcting climate change simulations-a critical review. Current Climate Change Reports, 2:211–220.
Mirhoseini et al., (2021) Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J. W., Songhori, E., Wang, S., Lee, Y.-J., Johnson, E., Pathak, O., Nazi, A., et al. (2021). A graph placement methodology for fast chip design. Nature, 594(7862):207–212.
Munos, (2003) Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567.
Munos, (2007) Munos, R. (2007). Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561.
Munos and Szepesvári, (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857.
Namkoong and Duchi, (2016) Namkoong, H. and Duchi, J. C. (2016). Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29.
Nilim and El Ghaoui, (2005) Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
Panaganti, (2023) Panaganti, K. (2023). Robust Reinforcement Learning: Theory and Algorithms. PhD thesis, Texas A&M University.
(47) Panaganti, K. and Kalathil, D. (2021a). Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning (ICML), pages 511–520.
(48) Panaganti, K. and Kalathil, D. (2021b). Sample complexity of model-based robust reinforcement learning. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 2240–2245.
Panaganti and Kalathil, (2022) Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 9582–9602.
Panaganti et al., (2022) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022). Robust reinforcement learning using offline data. Advances in Neural Information Processing Systems (NeurIPS).
(51) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023a). Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. arXiv preprint arXiv:2310.18434.
(52) Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023b). Distributionally robust behavioral cloning for robust imitation learning. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347.
Pioch et al., (2009) Pioch, N. J., Melhuish, J., Seidel, A., Santos Jr, E., Li, D., and Gorniak, M. (2009). Adversarial intent modeling using embedded simulation and temporal bayesian knowledge bases. In Modeling and Simulation for Military Operations IV, volume 7348, pages 115–126.
Robey et al., (2020) Robey, A., Hassani, H., and Pappas, G. J. (2020). Model-based robust deep learning: Generalizing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247.
Rockafellar and Wets, (2009) Rockafellar, R. T. and Wets, R. J.-B. (2009). Variational analysis, volume 317. Springer Science & Business Media.
Russel and Petrik, (2019) Russel, R. H. and Petrik, M. (2019). Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. Advances in Neural Information Processing Systems.
Scherrer et al., (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M. (2015). Approximate modified policy iteration and its application to the game of tetris. J. Mach. Learn. Res., 16(49):1629–1676.
Schmidt et al., (2015) Schmidt, T., Hertkorn, K., Newcombe, R., Marton, Z., Suppa, M., and Fox, D. (2015). Depth-based tracking with physical constraints for robot manipulation. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 119–126.
Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Shah et al., (2018) Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018). Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, pages 621–635. Springer.
Shalev-Shwartz and Ben-David, (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
Shapiro, (2017) Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275.
Shi and Chi, (2022) Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767.
Shi et al., (2023) Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M., and Chi, Y. (2023). The curious price of distributional robustness in reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 36.
Silver et al., (2018) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
Sinha et al., (2017) Sinha, A., Namkoong, H., and Duchi, J. C. (2017). Certifiable distributional robustness with principled adversarial training. corr, abs/1710.10571. arXiv preprint arXiv:1710.10571.
Song et al., (2023) Song, Y., Zhou, Y., Sekhari, A., Bagnell, D., Krishnamurthy, A., and Sun, W. (2023). Hybrid rl: Using both offline and online data can make rl efficient. In The Eleventh International Conference on Learning Representations.
Sünderhauf et al., (2018) Sünderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., Upcroft, B., Abbeel, P., Burgard, W., Milford, M., et al. (2018). The limits and potentials of deep learning for robotics. The International journal of robotics research, 37(4-5):405–420.
Szepesvári and Munos, (2005) Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on Machine learning, pages 880–887.
Van Erven et al., (2015) Van Erven, T., Grunwald, P., Mehta, N. A., Reid, M., Williamson, R., et al. (2015). Fast rates in statistical and online learning. JMLR.
Vershynin, (2018) Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University press.
Wang et al., (2021) Wang, R., Foster, D., and Kakade, S. M. (2021). What are the statistical limits of offline {rl} with linear function approximation? In International Conference on Learning Representations.
(74) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023a). A finite sample complexity bound for distributionally robust q-learning. In International Conference on Artificial Intelligence and Statistics, pages 3370–3398.
(75) Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). Sample complexity of variance-reduced distributionally robust q-learning. arXiv preprint arXiv:2305.18420.
(76) Wang, Y., Hu, Y., Xiong, J., and Zou, S. (2023c). Achieving minimax optimal sample complexity of offline reinforcement learning: A dro-based approach. arXiv preprint arXiv:2305.13289v2.
Wang and Zou, (2021) Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206.
Wang and Zou, (2022) Wang, Y. and Zou, S. (2022). Policy gradient method for robust reinforcement learning. In International Conference on Machine Learning, pages 23484–23526.
Wiesemann et al., (2013) Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183.
Xie et al., (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34.
Xu and Mannor, (2010) Xu, H. and Mannor, S. (2010). Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, pages 2505–2513.
Xu^∗ et al., (2023) Xu^∗, Z., Panaganti^∗, K., and Kalathil, D. (2023). Improved sample complexity bounds for distributionally robust reinforcement learning. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. Conference on Artificial Intelligence and Statistics.
Yang et al., (2021) Yang, J., Zhou, K., Li, Y., and Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
Yang et al., (2023) Yang, W., Wang, H., Kozuno, T., Jordan, S. M., and Zhang, Z. (2023). Avoiding model estimation in robust markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
Yu and Xu, (2015) Yu, P. and Xu, H. (2015). Distributionally robust counterpart in Markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543.
Zhang et al., (2023) Zhang, R., Hu, Y., and Li, N. (2023). Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
Zhou et al., (2023) Zhou, R., Liu, T., Cheng, M., Kalathil, D., Kumar, P., and Tian, C. (2023). Natural actor-critic for robust reinforcement learning with function approximation. In Thirty-seventh Conference on Neural Information Processing Systems.

☕ ☕ Supplementary Materials ☕ ☕

Appendix A Related Works ☕

Offline RL: Offline RL tackles the problem of learning optimal policy using minimal amount of offline/historical data collected according to a behavior policy (Lange et al.,, 2012; Levine et al.,, 2020). Due to offline data quality and no access to simulators or any world models for exploration, the offline RL problem suffers from the out-of-distribution (Robey et al.,, 2020; Yang et al.,, 2021) challenge. Many works (Fujimoto et al.,, 2019; Kumar et al.,, 2019, 2020; Fujimoto and Gu,, 2021; Kostrikov et al.,, 2021) have introduced deep offline RL algorithms aimed at alleviating the out-of-distribution issue by some variants of trust-region optimization (Schulman et al.,, 2015, 2017). The earliest and most promising theoretical investigations into model-free offline RL methodologies relied on the assumption of uniformly bounded concentrability such as the approximate modified policy iteration (AMPI) algorithm (Scherrer et al.,, 2015) and fitted Q-iteration (FQI) (Munos and Szepesvári,, 2008) algorithm. This assumption mandates that the ratio of the state-action occupancy distribution induced by any policy to the data generating distribution remains uniformly bounded across all states and actions (Munos,, 2007; Antos et al.,, 2008; Munos and Szepesvári,, 2008; Farahmand et al.,, 2010; Chen and Jiang,, 2019). This makes offline RL particularly challenging (Foster et al.,, 2022) and there have been efforts to understand the limits of this setting.

Robust RL: The robust Markov decision process framework (Nilim and El Ghaoui,, 2005; Iyengar,, 2005) tackles the challenge of formulating a policy resilient to model discrepancies between training and testing environments. Robust reinforcement learning problem pursues this objective in the data-driven domain. Deploying simplistic RL policies (Corporation,, 2021) can lead to catastrophic outcomes when faced with evident disparities in models. The optimization techniques and analyses in robust RL draw inspiration from the distributionally robust optimization (DRO) toolkit in supervised learning (Duchi and Namkoong,, 2018; Shapiro,, 2017; Gao and Kleywegt,, 2022; Bertsimas et al.,, 2018; Namkoong and Duchi,, 2016; Blanchet et al.,, 2019). Many heuristic works (Xu and Mannor,, 2010; Wiesemann et al.,, 2013; Yu and Xu,, 2015; Mannor et al.,, 2016; Russel and Petrik,, 2019) show robust RL is valuable in such scenarios involving disparities of a simulator model with the real-world model. Many recent works address fundamental issues of RMDP giving concrete theoretical understanding in terms of sample complexity (Panaganti and Kalathil, 2021b, ; Panaganti and Kalathil,, 2022; Xu^∗ et al.,, 2023; Shi and Chi,, 2022; Shi et al.,, 2023). Many works (Panaganti and Kalathil, 2021a, ; Wang and Zou,, 2021; Panaganti and Kalathil,, 2022) devise model-free online and offline robust RL algorithms employing general function approximation to handle potentially infinite state spaces. Recent work (Panaganti et al., 2023b, ) introduces distributional robustness in the imitation learning setting. There have been works (Panaganti,, 2023; Panaganti et al., 2023a, ; Wang et al., 2023c, ) connecting robust RL with offline RL by linking notions of robustness and pessimism.

Appendix B Useful Technical Results ☕☕

We state the following result from the penalized distributionally robust optimization literature (Levy et al.,, 2020).

Lemma 1 (Levy et al.,, 2020, Section A.1.2).

Let $P^{o}$ be a distribution on the space $\mathcal{X}$ and let $l:\mathcal{X}\to\mathbb{R}$ be a loss function. For $\varphi$ -divergence (1), we have

\displaystyle\sup_{P\ll P^{o}}\mathbb{E}_{P}[l(X)-\lambda D_{\varphi}(P,P^{o})% ]=\inf_{\eta\in\mathbb{R}}~{}~{}\lambda\mathbb{E}_{P^{o}}\left[\varphi^{*}% \left(\frac{l(X)-\eta}{\lambda}\right)\right]+\eta,

where $\varphi^{*}(s)=\sup_{t\geq 0}\{st-\varphi(t)\}$ is the Fenchel conjugate function of $\varphi$ . Moreover, the optimization on the right hand side is convex in $\eta$ .

We state a standard concentration inequality here.

Lemma 2 (Bernstein’s Inequality (Vershynin,, 2018, Theorem 2.8.4)).

Fix any $\delta\in(0,1)$ . If $X_{1},\cdots,X_{T}$ are independent and identically distributed random variables with finite second moment. Assume that $|X_{t}-\mathbb{E}[X_{t}]|\leq M$ , for all $t$ . Then we have with probability at least $1-\delta$ :

\Bigg{|}\mathbb{E}[X_{1}]-\frac{1}{T}\sum_{t=1}^{T}X_{t}\Bigg{|}\leq\sqrt{% \frac{2\mathbb{E}[X_{1}^{2}]\log(2/\delta)}{T}}+\frac{M\log(2/\delta)}{3T}.

We now state a useful concentration inequality when the samples are not necessarily i.i.d. but adapted to a stochastic process.

Lemma 3 (Freedman’s Inequality (Song et al.,, 2023, Lemma 14)).

Let $X_{1},\cdots,X_{T}$ be a sequence of $M>0$ -bounded real valued random variables where $X_{t}\sim P_{t}$ from some stochastic process $P_{t}$ that depends on the history $X_{1},\cdots,X_{t-1}$ . Then, for any $\delta>0$ and $\lambda\in[0,1/2M]$ , we have with probability at least $1-\delta$ :

\displaystyle\Bigg{|}\sum_{t=1}^{T}(X_{t}-\mathbb{E}[X_{t}\mid P_{t}])\Bigg{|}% \leq\lambda\sum_{t=1}^{T}(2M|\mathbb{E}[X_{t}\mid P_{t}]|+\mathbb{E}[X_{t}^{2}% \mid P_{t}])+\frac{\log(2/\delta)}{\lambda}.

We now state a result for the generalization bounds on empirical risk minimization (ERM) (Shalev-Shwartz and Ben-David,, 2014).

Lemma 4 (ERM Generalization Bound (Panaganti et al.,, 2022, Lemma 3)).

Let $P$ be the data generating distribution on the space $\mathcal{X}$ and let $\mathcal{H}$ be a given hypothesis class of functions. Assume that for all $x\in\mathcal{X}$ and $h\in\mathcal{H}$ for loss function $l$ we have that $|l(h,x)|\leq c_{1}$ for some positive constant $c_{1}>0$ and $l(h,x)$ is $c_{3}$ -Lipschitz in $h$ . Given a dataset $\mathcal{D}=\{X_{i}\}_{i=1}^{N}$ , generated independently from $P$ , denote $\hat{h}$ as the ERM solution, i.e. $\hat{h}=\operatorname*{arg\,min}_{h\in\mathcal{H}}(1/N)\sum_{i=1}^{N}l(h,X_{i})$ . Furthermore, let $\mathcal{H}$ be a finite hypothesis class, i.e. $|\mathcal{H}|<\infty$ , with $|h\circ x|\leq c_{2}$ for all $h\in\mathcal{H}$ and $x\in\mathcal{X}$ . For any fixed $\delta\in(0,1)$ and $h^{*}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\mathbb{E}_{X\sim P}[l(h,X)]$ , we have

\displaystyle\mathbb{E}_{X\sim P}[l(\hat{h},X)]-\mathbb{E}_{X\sim P}[l(h^{*},X% )]\leq 2c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{H}|)}{N}}+5c_{1}\sqrt{\frac{2% \log(8/\delta)}{N}},

with probability at least $1-\delta$ .

We now state a result from variational analysis literature (Rockafellar and Wets,, 2009) that is useful to relate minimization of integrals and the integrals of pointwise minimization under decomposable spaces.

Remark 4.

A few examples of decomposable spaces are $L^{p}(\mathcal{S}\times\mathcal{A},\Sigma(\mathcal{S}\times\mathcal{A}),\mu)$ , for any $p\geq 1$ , and $\mathcal{M}(\mathcal{S}\times\mathcal{A},\Sigma(\mathcal{S}\times\mathcal{A}))$ , the space of all $\Sigma(\mathcal{S}\times\mathcal{A})$ -measurable functions.

Lemma 5 (Rockafellar and Wets,, 2009, Theorem 14.60).

Let $\mathcal{X}$ be a space of measurable functions from $\Omega$ to $\mathbb{R}$ that is decomposable relative to a $\sigma$ -finite measure $\mu$ on the $\sigma$ -algebra $\mathcal{A}$ . Let $f:\Omega\times\mathbb{R}\to\mathbb{R}$ (finite-valued) be a normal integrand. Then, we have

\inf_{x\in\mathcal{X}}\int_{\omega\in\Omega}f(\omega,x(\omega))\mu(\mathrm{d}% \omega)=\int_{\omega\in\Omega}\left(\inf_{x\in\mathbb{R}}f(\omega,x)\right)\mu% (\mathrm{d}\omega).

Moreover, as long as the above infimum is finite, we have that $x^{\prime}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}\int_{\omega\in\Omega}f% (\omega,x(\omega))\mu(\mathrm{d}\omega)$ if and only if $x^{\prime}(\omega)\in\operatorname*{arg\,min}_{x\in\mathbb{R}}f(\omega,x)$ for $\mu$ -almost everywhere.

Now we state a few results that will be useful for the analysis of our finite-horizon results in this work. The following result (Song et al.,, 2023, Lemma 6) is useful under the use of bilinear model approximation. This result follows from the elliptical potential lemma (Lattimore and Szepesvári,, 2020, Lemma 19.4) for deterministic vectors.

Lemma 6 (Elliptical Potential Lemma).

Let $X_{h}(f^{1}),\cdots,X_{h}(f^{T})\in\mathbb{R}^{d}$ be a sequence of vectors with $\left\|X_{h}(f^{t})\right\|\leq B_{X}<\infty$ for all $t\leq T$ and fix $\sigma\geq B^{2}_{X}$ . Define $\Sigma_{t;h}=\sum_{\tau=1}^{t}X_{h}(f^{\tau})X_{h}(f^{\tau})^{\top}+\sigma% \mathds{1}_{d\times d}$ for $t\in[T]$ . Then, the following holds: $\sum_{t=1}^{T}\|X_{h}(f^{t})\|_{\Sigma_{t-1;h}^{-1}}\leq\sqrt{2dT\log(1+({TB_{% X}^{2}}/{(\sigma d)}))}.$

We now state a result for the generalization bounds on the least-squares regression problem when the data are not necessarily i.i.d. but adapted to a stochastic process. We refer to Van Erven et al., (2015) for more statistical and online learning generalization bounds for a wider class of loss functions.

Lemma 7 (Online Least-squares Generalization Bound (Song et al.,, 2023, Lemma 3)).

Let $L,M>0$ , $\delta\in(0,1)$ , and let $\mathcal{X}$ be an input space and $\mathcal{Y}$ be a target space . Let $\mathcal{H}:\mathcal{X}\mapsto[-M,M]$ be a given real-valued hypothesis class of functions with $|\mathcal{H}|<\infty$ . Given a dataset $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N}$ , denote $\widehat{h}$ as the least square solution, i.e. $\widehat{h}=\operatorname*{arg\,min}_{h\in\mathcal{H}}\sum_{i=1}^{N}(h(x_{t})-% y_{t})^{2}$ . The dataset $\mathcal{D}$ is generated as $x_{t}\sim P_{t}$ from some stochastic process $P_{t}$ that depends on the history $\{(x_{1},y_{1}),\dots,(x_{t-1},y_{t-1})\}$ , and $y_{t}$ is sampled via the conditional probability $p(\cdot\mid x_{t})$ as $y_{t}\sim p(\cdot\mid x_{t})=h^{*}(x_{t})+\varepsilon_{t},$ where the function $h^{*}$ satisfies approximate realizability i.e. $\inf_{h\in\mathcal{H}}(1/N)\sum_{t=1}^{N}\mathbb{E}_{x\sim P_{t}}(h^{*}(x)-h(x% ))^{2}\leq\gamma,$ and ${\varepsilon_{t}}_{t=1}^{N}$ are independent random variables such that $\mathbb{E}[y_{t}\mid x_{t}]=h^{*}(x_{t})$ . Suppose it also holds $\max_{t}|y_{t}|\leq L$ and $\max_{x}|h^{*}(x)|\leq M$ . Then, the least square solution satisfies with probability at least $1-\delta$ :

\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x\sim P_{t}}(\widehat{h}(x)-h^{*}(x))^{2}

\displaystyle\leq 3\gamma N+64(L+M)^{2}\log(2|\mathcal{H}|/\delta).

Appendix C Useful Foundational Results ☕☕☕

We provide the following result highlighting the necessary characteristics for specific examples of the Fenchel conjugate functions $\varphi^{*}$ .

Proposition 3 ( $\varphi$ -Divergence Bounds).

Let $V\in[0,V_{\max}]^{|\mathcal{S}|}$ be any value function and fix a probability distribution $P^{o}\in\Delta(\mathcal{S})$ . Define $h(y,\eta)=(\lambda\varphi^{*}\left({(\eta-y)}/{\lambda}\right)-\eta)$ . Consider the following scalar convex optimization problem: $\inf_{\eta\in\Theta\subseteq\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)$ . Let the maximum absolute value in $\Theta$ be less than or equal to $c_{3}$ , let $|h(V(s),\eta)|\leq c_{1}$ for all $\eta\in\Theta$ , and let $h(V(s),\eta)$ be $c_{2}$ -Lipschitz in $\eta$ ; hold for some positive constants $c_{1},c_{2},c_{3}$ . We have the following results for different forms of $\varphi$ :
(i) Let Assumption 8 hold. For TV distance i.e. $\varphi(t)=|t-1|/2$ , we have $\Theta\equiv[-\lambda/2,\lambda/2]$ , hence $c_{3}=\lambda/2$ , $c_{1}=2\lambda+V_{\max}$ , and $c_{2}=2$ .
(ii) For chi-square divergence i.e. $\varphi(t)=(t-1)^{2}$ , we have $\Theta\equiv[-\lambda,2V_{\max}+2\lambda]$ , hence $c_{3}=2V_{\max}+2\lambda$ , $c_{1}=\lambda+(2V_{\max}+4\lambda)(\frac{2V_{\max}}{4\lambda}+2)$ , and $c_{2}=(3+\frac{V_{\max}}{\lambda})$ .
(iii) For KL divergence i.e. $\varphi(t)=(t-1)^{2}$ , we have $\Theta\equiv[\lambda,V_{\max}+\lambda]$ , hence $c_{3}=V_{\max}+\lambda$ , $c_{1}=\lambda(\exp(\frac{V_{\max}}{\lambda})-1)$ , and $c_{2}=(\exp(\frac{V_{\max}}{\lambda})+1)$ .
(iv) Fix $\alpha\in(0,1)$ . For $\alpha$ -CVaR i.e. $\varphi(t)=\mathds{1}[0,1/\alpha)$ , we have $\Theta\equiv[0,V_{\max}/(1-\alpha)]$ , hence $c_{3}=V_{\max}/(1-\alpha)$ , $c_{1}=2V_{\max}/(\alpha(1-\alpha))$ , and $c_{2}=1+\alpha^{-1}$ .

Proof.

We first prove the statement for TV distance with $\varphi(t)=|t-1|/2$ . From $\varphi$ -divergence literature (Xu^∗ et al.,, 2023), we know

\varphi^{*}(s)=\begin{cases}-\frac{1}{2}&s\leq-\frac{1}{2},\\ s&s\in[-\frac{1}{2},\frac{1}{2}]\\ +\infty&s>\frac{1}{2}.\end{cases}.

Thus, we have

$\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)$	$\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\inf_{\eta\in\mathbb{R},\frac{% \eta-\min_{s}V(s)}{\lambda}\leq\frac{1}{2}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \max\{\frac{\eta-V(s)}{\lambda},-\frac{1}{2}\}]-\eta$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\inf_{\eta\in\mathbb{R},\frac{% \eta}{\lambda}\leq\frac{1}{2}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda\max\{\frac{% \eta-V(s)}{\lambda},-\frac{1}{2}\}]-\eta$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}\inf_{\eta\in\mathbb{R},\frac{% \eta}{\lambda}\leq\frac{1}{2}}~{}\mathbb{E}_{s\sim P^{o}}[(\eta-V(s)+\lambda/2% )_{+}]-\lambda/2-\eta$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{=}}\inf_{\eta^{\prime}\in\mathbb{R}% ,\eta^{\prime}\leq\lambda}~{}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}% ]-\eta^{\prime}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{=}}\inf_{0\leq\eta^{\prime}\leq% \lambda}~{}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}]-\eta^{\prime},$	(17)

where $(a)$ follows by definition of $\varphi^{*}$ , $(b)$ by Assumption 8, $(c)$ by the fact $\max\{x,y\}=(x-y)_{+}+y$ for any $x,y\in\mathbb{R}$ , and $(d)$ follows by making the substitution $\eta=\eta^{\prime}-\lambda/2$ . Finally, for $(e)$ , notice that since $V(s)\geq 0,$ $\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}]-\eta^{\prime}=-\eta^{\prime% }\geq 0$ holds when $\eta^{\prime}\leq 0$ . So $\inf_{\eta^{\prime}\in(-\infty,0]}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s)% )_{+}]-\eta^{\prime}=0$ is achieved at $\eta^{\prime}=0$ .

We immediately have $\Theta\equiv[-\lambda/2,\lambda/2]$ since $\eta=\eta^{\prime}-\lambda/2$ . Since $\eta\leq\lambda/2$ and $V(s)\leq V_{\max}$ , we further get $|h(V(s),\eta)|\leq 2\lambda+V_{\max}$ . For $\eta_{1},\eta_{2}\in\Theta$ , from the fact $|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|\leq|x-y|$ we have $|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq 2|\eta_{1}-\eta_{2}|$ . This proves statement $(i)$ .

We now prove the statement for chi-square divergence with $\varphi(t)=(t-1)^{2}$ following similar steps as before. From $\varphi$ -divergence literature (Xu^∗ et al.,, 2023), we know $\varphi^{*}(s)=(s/2+1)_{+}^{2}-1.$ Thus, we have

	$\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)$	$\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta$
		$\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda(% \frac{\eta-V(s)}{2\lambda}+1)_{+}^{2}]-\lambda-\eta$
		$\displaystyle\stackrel{{\scriptstyle(f)}}{{=}}\inf_{\eta^{\prime}\in\mathbb{R}% }~{}\frac{1}{4\lambda}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s))_{+}^{2}]+% \lambda-\eta^{\prime}$
		$\displaystyle\stackrel{{\scriptstyle(g)}}{{=}}\inf_{\eta^{\prime}\in[\lambda,2% V_{\max}+4\lambda]}~{}\frac{1}{4\lambda}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime% }-V(s))_{+}^{2}]+\lambda-\eta^{\prime},$

where $(f)$ follows by making the substitution $\eta=\eta^{\prime}-2\lambda$ . Finally, for $(g)$ , observe that the function $g(\eta^{\prime})=\frac{1}{4\lambda}\mathbb{E}_{s\sim P^{o}}[(\eta^{\prime}-V(s% ))_{+}^{2}]+\lambda-\eta^{\prime}$ is convex in the dual variable $\eta^{\prime}$ and $\inf_{\eta^{\prime}\in\mathbb{R}}g(\eta^{\prime})\leq 0$ since it is a Lagrangian dual variable. Since $V(s)\geq 0,$ $\lambda-\eta^{\prime}_{*}\leq 0$ where $\eta^{\prime}_{*}$ is any solution of $\inf_{\eta^{\prime}\in\mathbb{R}}g(\eta^{\prime})\leq 0$ . When $\eta^{\prime}\geq 2V_{\max}+4\lambda$ , notice that $g(\eta^{\prime})\geq\frac{1}{4\lambda}(\eta^{\prime 2}-2(V_{\max}+2\lambda)% \eta^{\prime}+4\lambda^{2})\geq\lambda>0,$ since $0\leq V(s)\leq V_{\max}$ .

We immediately have $\Theta\equiv[-\lambda,2V_{\max}+2\lambda]$ since $\eta=\eta^{\prime}-2\lambda$ . Since $\eta\leq 2V_{\max}+2\lambda$ and $V(s)\geq 0$ , we further get $|h(V(s),\eta)|\leq\lambda+(2V_{\max}+4\lambda)(\frac{2V_{\max}}{4\lambda}+2)$ . For $\eta_{1},\eta_{2}\in\Theta$ , from the facts $|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|\leq|x-y|$ and $|(x)_{+}^{2}-(y)_{+}^{2}|=|(x)_{+}-(y)_{+}|((x)_{+}+(y)_{+})$ , we have $|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq(3+(V_{\max}))|\eta_{1}-\eta_{2}|$ . This proves statement $(ii)$ .

We now prove the statement for KL divergence with $\varphi(t)=t\log{t}$ following similar steps as before. From $\varphi$ -divergence literature (Xu^∗ et al.,, 2023), we know $\varphi^{*}(s)=\exp(s-1).$ Thus, we have

	$\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)$	$\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta$
		$\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda\exp(% \frac{\eta-V(s)}{\lambda}-1)]-\eta$
		$\displaystyle\stackrel{{\scriptstyle(h)}}{{=}}\inf_{\eta^{\prime}\in\mathbb{R}% }~{}\lambda\mathbb{E}_{s\sim P^{o}}[\exp(\frac{-\eta^{\prime}-V(s)}{\lambda}-1% )]+\eta^{\prime}$
		$\displaystyle\stackrel{{\scriptstyle(j)}}{{=}}\inf_{\eta^{\prime}\in[-\lambda-% V_{\max},-\lambda]}~{}\lambda\mathbb{E}_{s\sim P^{o}}[\exp(\frac{-\eta^{\prime% }-V(s)}{\lambda}-1)]+\eta^{\prime},$

where $(h)$ follows by making the substitution $\eta=-\eta^{\prime}$ . Finally, for $(j)$ , observe that the function $g(\eta^{\prime})=\lambda\mathbb{E}_{s\sim P^{o}}[\exp(\frac{-\eta^{\prime}-V(s% )}{\lambda}-1)]+\eta^{\prime}$ is convex in the dual variable $\eta^{\prime}$ since it is a Lagrangian dual variable. From Calculus, the optimal $\eta^{\prime}=-\lambda+\lambda\log\mathbb{E}_{s\sim P^{o}}\exp({-V(s)}/{% \lambda})$ . So $\eta^{\prime}\in[-\lambda-V_{\max},-\lambda]$ since $0\leq V(s)\leq V_{\max}$ .

We immediately have $\Theta\equiv[\lambda,V_{\max}+\lambda]$ since $\eta=-\eta^{\prime}$ . Since $\eta\leq V_{\max}+\lambda$ and $V(s)\geq 0$ , we further get $|h(V(s),\eta)|\leq\lambda(\exp(\frac{V_{\max}}{\lambda})-1)$ . For $\eta_{1},\eta_{2}\in\Theta$ , from the fact $\exp(-x)$ is $1$ -Lipschitz for $x\geq 0$ , we have $|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq(\exp(\frac{V_{\max}}{\lambda})+1)|\eta% _{1}-\eta_{2}|$ . This proves statement $(ii)$ .

We now prove the statement for $\alpha$ -CVAR with $\varphi(t)=\mathds{1}[0,1/\alpha)$ . From $\varphi$ -divergence literature (Levy et al.,, 2020), we know $\varphi^{*}(s)=(s)_{+}/\alpha.$ Thus, we have

$\displaystyle\inf_{\eta\in\mathbb{R}}\mathbb{E}_{s\sim P^{o}}h(V(s),\eta)$	$\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\mathbb{E}_{s\sim P^{o}}[\lambda% \varphi^{*}(\frac{\eta-V(s)}{\lambda})]-\eta$
	$\displaystyle=\inf_{\eta\in\mathbb{R}}~{}\frac{1}{\alpha}\mathbb{E}_{s\sim P^{% o}}[(\eta-V(s))_{+}]-\eta$
	$\displaystyle\stackrel{{\scriptstyle(k)}}{{=}}\inf_{0\leq\eta\leq V_{\max}/(1-% \alpha)}~{}\frac{1}{\alpha}\mathbb{E}_{s\sim P^{o}}[(\eta-V(s))_{+}]-\eta.$	(18)

For $(k)$ , notice that since $V(s)\geq 0,$ $(1/\alpha)\mathbb{E}_{s\sim P^{o}}[(\eta-V(s))_{+}]-\eta=-\eta\geq 0$ holds when $\eta\leq 0$ . Also, since $V(s)\leq V_{\max},$ $(1/\alpha)\mathbb{E}_{s\sim P^{o}}[(\eta-V(s))_{+}]-\eta\geq 0$ holds when $\eta\geq V_{\max}/(1-\alpha)$ .

We immediately have $\Theta\equiv[0,V_{\max}/(1-\alpha)]$ . We further get $|h(V(s),\eta)|\leq 2V_{\max}/(\alpha(1-\alpha))$ . For $\eta_{1},\eta_{2}\in\Theta$ , from the fact $|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|\leq|x-y|$ we have $|h(V(s),\eta_{1})-h(V(s),\eta_{2})|\leq(1+\alpha^{-1})|\eta_{1}-\eta_{2}|$ . This proves the final statement of this result. ∎

We now state and prove a generalization bound for empirical risk minimization when the data are not necessarily i.i.d. but adapted to a stochastic process. This result is of independent interest to more machine learning problems outside of the scope of this paper as well. Furthermore, this result showcases better rate dependence on $N$ , from $\widetilde{\mathcal{O}}(1/\sqrt{N})$ to $\widetilde{\mathcal{O}}(1/{N})$ , than the classical result Lemma 4 (Shalev-Shwartz and Ben-David,, 2014). This result is not surprising and we refer to Van Erven et al., (2015, Theorems 7.6 & 5.4), in the i.i.d. setting, for such $\widetilde{\mathcal{O}}(1/{N})$ fast rates with bounded losses to empirical risk minimization and beyond.

Proposition 4 (Online ERM Generalization Bound).

Let $N>0$ , $\delta\in(0,1)$ , let $\mathcal{X}$ be an input space, and let $\mathcal{Y}$ be the target functional space. Let $\mathcal{H}\subseteq\mathcal{Y}$ be the given finite class of functions. Assume that for all $x\in\mathcal{X}$ and $h\in\mathcal{H}$ for loss function $l$ we have that $|l(h(x))|\leq c$ for some positive constant $c>0$ . Given a dataset $\mathcal{D}=\{x_{i}\}_{i=1}^{N}$ , denote $\widehat{h}$ as the ERM solution, i.e. $\widehat{h}\leftarrow\operatorname*{arg\,min}_{h\in\mathcal{H}}\sum_{i=1}^{N}l% (h(x_{i}))$ . The dataset $\mathcal{D}$ is generated as $x_{t}\sim P_{t}$ from some stochastic process $P_{t}$ that depends on the history $\{x_{1},\dots,x_{t-1}\}$ , where the function $h^{*}_{t}\in\operatorname*{arg\,min}_{f\in\mathcal{Y}}\mathbb{E}_{x\sim P_{t}}% [l(f(x))]$ satisfies approximate realizability i.e.

\inf_{h\in\mathcal{H}}\frac{1}{N}\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(% h(x_{t}))-l(h^{*}_{t}(x_{t})))\leq\gamma,

and for all $x\in\mathcal{X}$ , $|l(h^{*}_{t}(x))|\leq c$ . Then, the ERM solution satisfies

\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}l(\widehat{h}(x_{t}))-% \sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}l(h^{*}_{t}(x_{t}))

\displaystyle\leq 3\gamma N+48c\log(2\left\lvert\mathcal{H}\right\rvert/\delta)

with probability at least $1-\delta$ .

Proof.

We adapt the proof of least-squares generalization bound (Song et al.,, 2023, Lemma 3) here for the empirical risk minimization generalization bound under online data collection. Fix any function $h\in\mathcal{H}$ . We define the random variable $Z_{t}^{h}=l(h(x_{t}))-l(h^{*}_{t}(x_{t})).$ Immediately, we note $|Z_{t}^{h}|\leq 2c$ for all $t$ . By definition of $h^{*}_{t}$ , we have a non-negative first moment of $Z_{t}^{h}$ :

\displaystyle\mathbb{E}_{P_{t}}[Z^{h}_{t}]

\displaystyle=\mathbb{E}_{x_{t}\sim P_{t}}l(h(x_{t}))-\mathbb{E}_{x_{t}\sim P_% {t}}l(h^{*}_{t}(x_{t})).

(19)

By symmetrization, assuming $l(h^{*}_{t}(x_{t}))^{2}\leq l(h(x_{t}))^{2}$ , we have that

	$\displaystyle 0\leq\mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}]$	$\displaystyle\leq\mathbb{E}_{x_{t}\sim P_{t}}[2l(h(x_{t}))^{2}-2\cdot l(h(x_{t% }))\cdot l(h^{*}_{t}(x_{t}))]$
		$\displaystyle\leq 2\|l(h(x_{t}))\|\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{% *}_{t}(x_{t})))$
		$\displaystyle\leq 2c\cdot\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(% x_{t}))).$

Similarly assuming $l(h^{*}_{t}(x_{t}))^{2}\geq l(h(x_{t}))^{2}$ , we get $0\leq\mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}]\leq 2c\cdot\mathbb{E}_{x_{t}\sim P_{t% }}(l(h(x_{t}))-l(h^{*}_{t}(x_{t})))$ . Thus, uniformly, we have

\displaystyle 0\leq\mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}]\leq 2c\cdot\mathbb{E}_{% x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(x_{t}))).

(20)

We remark that (20) is called Bernstein condition (Van Erven et al.,, 2015, Definition 5.1) when all sampling distributions $P_{t}$ ’s are identical. This is one of the sufficient conditions on the loss functions to get $\mathcal{O}(1/N)$ -generalization bounds for empirical risk minimization.

Now, applying Lemma 3 with $\lambda\in[0,1/4c]$ and $\delta>0$ , we have

	$\displaystyle\left\lvert\sum_{t=1}^{N}Z^{h}_{t}-\mathbb{E}_{P_{t}}[Z_{t}^{h}]\right\rvert$	$\displaystyle\leq\lambda\sum_{t=1}^{N}(4c\|\mathbb{E}_{P_{t}}[Z_{t}^{h}]\|+% \mathbb{E}_{P_{t}}[(Z_{t}^{h})^{2}])+\frac{\log(2/\delta)}{\lambda}$
		$\displaystyle\leq 6c\lambda\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t% }))-l(h^{*}_{t}(x_{t})))+\frac{\log(2/\delta)}{\lambda}$

with probability at least $1-\delta$ , where the last inequality uses (19) and (20). We set $\lambda=1/12c$ in the above, we get for any $h\in\mathcal{H}$ , with probability at least $1-\delta$ :

\displaystyle\left\lvert\sum_{t=1}^{N}Z^{h}_{t}-\mathbb{E}_{P_{t}}[Z_{t}^{h}]% \right\rvert\leq\frac{1}{2}\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t% }))-l(h^{*}_{t}(x_{t})))+12c\log(2\left\lvert\mathcal{H}\right\rvert/\delta),

by union bound over $h\in\mathcal{H}$ . Using (19), we rearrange the above to get:

	$\displaystyle\sum_{t=1}^{N}Z_{t}^{h}\leq\frac{3}{2}\sum_{t=1}^{N}\mathbb{E}_{x% _{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(x_{t})))+12c\log(2\left\lvert\mathcal{% H}\right\rvert/\delta)$	(21)
and
	$\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t% }(x_{t})))\leq 2\sum_{t=1}^{N}Z_{t}^{h}+24c\log(2\left\lvert\mathcal{H}\right% \rvert/\delta).$	(22)

Define the function $\widetilde{h}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\sum_{t=1}^{N}% \mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{t}))-l(h^{*}_{t}(x_{t})))$ , which is independent of the dataset $\mathcal{D}$ . By (21) for $\widetilde{h}$ and the approximate realizability assumption, we get

\displaystyle\sum_{t=1}^{N}Z_{t}^{\widetilde{h}}

\displaystyle\leq\frac{3}{2}\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}(l(h(x_{% t}))-l(h^{*}_{t}(x_{t})))+12c\log(2\left\lvert\mathcal{H}\right\rvert/\delta)% \leq\frac{3}{2}\gamma N+12c\log(2\left\lvert\mathcal{H}\right\rvert/\delta).

By definitions of $\widetilde{h}$ and the ERM function $\widehat{h}$ , we have that

\displaystyle\sum_{t=1}^{N}Z_{t}^{\widehat{h}}

\displaystyle=\sum_{t=1}^{N}l(\widehat{h}(x_{t}))-l(h^{*}_{t}(x_{t}))\leq\sum_% {t=1}^{N}l(\widetilde{h}(x_{t}))-l(h^{*}_{t}(x_{t}))=\sum_{t=1}^{N}Z_{t}^{% \widetilde{h}}.

From the above two relations, we get

\displaystyle\sum_{t=1}^{N}Z_{t}^{\widehat{h}}

\displaystyle\leq\frac{3}{2}\gamma N+12c\log(2\left\lvert\mathcal{H}\right% \rvert/\delta).

Now, using this and using (22) for the function $\widehat{h}$ , we get

\displaystyle\sum_{t=1}^{N}\mathbb{E}_{x_{t}\sim P_{t}}l(\widehat{h}(x_{t}))-l% (h^{*}_{t}(x_{t}))

\displaystyle\leq 2\sum_{t=1}^{N}Z_{t}^{\widehat{h}}+24c\log(2\left\lvert% \mathcal{H}\right\rvert/\delta)\leq 3\gamma N+48c\log(2\left\lvert\mathcal{H}% \right\rvert/\delta),

which holds with probability at least $1-\delta$ . This completes the proof. ∎

We now state a useful result for an infinite-horizon discounted robust $\varphi$ -regularized Markov decision process $(\mathcal{S},\mathcal{A},r,P^{o},\lambda,\gamma,\varphi,d_{0})$ . This result helps our RPQ algorithm’s policy search space to be the class of deterministic Markov policies.

Proposition 5.

The robust regularized Bellman operator $\mathcal{T}$ (3)

\displaystyle(\mathcal{T}Q)(s,a)=r(s,a)+\gamma\inf_{P_{s,a}\in\mathcal{P}_{s,a% }}\big{(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})\big{)},

and the value function operator $(\mathcal{T}_{v}V)(\cdot)=\max_{a}(\mathcal{T}Q)(\cdot,a)$ are both $\gamma$ -contraction operators w.r.t sup-norm. Moreover, their respective unique fixed points $Q^{*}_{\lambda}$ and $V^{*}_{\lambda}$ , for optimal policy $\pi^{*}$ , achieve the optimal robust value $\max_{\pi}V^{\pi}_{\lambda}$ . Furthermore, the robust regularized optimal policy $\pi^{*}$ is a deterministic Markov policy satisfying $\pi^{*}(\cdot)=\operatorname*{arg\,max}_{a}Q^{*}_{\lambda}(\cdot,a)$ .

Proof.

The $\gamma$ -contraction property of both operators directly follow from the fact $\inf_{x}p(x)-\inf_{x}q(x)\leq\sup_{x}(p(x)-q(x))$ . Furthermore, this result is a direct corollary of (Yang et al.,, 2023, Proposition 3.1) and (Iyengar,, 2005, Corollary 3.1). ∎

We now state a similar result for a finite-horizon discounted robust $\varphi$ -regularized Markov decision process $(\mathcal{S},\mathcal{A},P^{o}=(P^{o}_{h})_{h=0}^{H-1},r=(r_{h})_{h=0}^{H-1},% \lambda,H,\varphi,d_{0})$ . This result helps our HyTQ algorithm’s policy search space to be the class of non-stationary deterministic Markov policies.

Proposition 6.

The robust regularized Bellman operator $\mathcal{T}$ (10) and the value function operator $\mathcal{T}_{v}$ are as follows:

	$\displaystyle(\mathcal{T}Q_{h+1})(s,a)=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P% }_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[\max_{a^{\prime}}Q_{h+1% }(s^{\prime},a^{\prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}% \quad\text{and}$
	$\displaystyle(\mathcal{T}_{v}V_{h+1})(s)=\max_{a}\bigg{[}r_{h}(s,a)+\inf_{P_{h% ,s,a}\in\mathcal{P}_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[V_{h+% 1}(s^{\prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}\bigg{]}.$

The optimal robust value $V^{*}_{h,\lambda}$ satisfies the following robust dynamic programming procedure: Starting with $V^{*}_{H,\lambda}=0$ , doing backward iteration of $\mathcal{T}_{v}$ , i.e., $V^{*}_{h,\lambda}=\mathcal{T}_{v}V^{*}_{h+1,\lambda}$ , we get $V^{*}_{h,\lambda}$ for all $h\in[H]$ . Furthermore, the robust regularized optimal policy $\pi^{*}$ is a non-stationary deterministic Markov policy satisfying $\pi^{*}_{h}(\cdot)=\operatorname*{arg\,max}_{a}Q^{*}_{h,\lambda}(\cdot,a)$ for all $h\in[H]$ where

\displaystyle Q^{*}_{h,\lambda}(\cdot,a)=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal% {P}_{h,s,a}}\big{(}\mathbb{E}_{s^{\prime}\sim P_{h,s,a}}[V^{*}_{h+1}(s^{\prime% })]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}.

Moreover, as $V^{*}_{H,\lambda}=0=Q^{*}_{H,\lambda}$ , it suffices to backward iterate $\mathcal{T}$ , i.e., do $Q^{*}_{h,\lambda}=\mathcal{T}Q^{*}_{h+1,\lambda}$ to get $Q^{*}_{h,\lambda}$ for all $h\in[H]$ .

Proof.

We start with the optimal robust value definition $V^{*}_{h,\lambda}=\max_{\pi}V^{\pi}_{h,\lambda}=\max_{\pi}\inf_{P\in\mathcal{P% }}V^{h,\pi}_{P,r^{\lambda}_{h}}$ . The value function claims in this statement are direct consequences of (Iyengar,, 2005, Theorem 2.1 & 2.2) and (Zhang et al.,, 2023, Theorem 2) with the reward function $r^{\lambda}_{h}$ .

It remains to prove $Q^{*}$ dynamic programming with $\mathcal{T}$ . That is, we establish $V^{*}_{h,\lambda}(\cdot)=\max_{a}Q^{*}_{h,\lambda}(\cdot,a)$ for all $h\in[H]$ with the dynamic programming of $\mathcal{T}$ . We use induction to prove this. The base case is trivially true since $V^{*}_{H,\lambda}=0=Q^{*}_{H,\lambda}$ . By $\mathcal{T}$ , we have

	$\displaystyle Q^{*}_{h,\lambda}(s,a)$	$\displaystyle=(\mathcal{T}Q^{*}_{h+1,\lambda})(s,a)$
		$\displaystyle=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P}_{h,s,a}}\big{(}\mathbb{% E}_{s^{\prime}\sim P_{h,s,a}}[\max_{a^{\prime}}Q^{*}_{h+1}(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{h,s,a},P^{o}_{h,s,a})\big{)}$
		$\displaystyle=r_{h}(s,a)+\inf_{P_{h,s,a}\in\mathcal{P}_{h,s,a}}\big{(}\mathbb{% E}_{s^{\prime}\sim P_{h,s,a}}[V^{*}_{h+1}(s^{\prime})]+\lambda D_{\varphi}(P_{% h,s,a},P^{o}_{h,s,a})\big{)},$

where the last equality follows by the induction hypothesis $V^{*}_{h+1,\lambda}(\cdot)=\max_{a}Q^{*}_{h+1,\lambda}(\cdot,a)$ . Maximizing this both sides with action $a$ and by the dynamic program $V^{*}_{h,\lambda}=\mathcal{T}_{v}V^{*}_{h+1,\lambda}$ , we get $V^{*}_{h,\lambda}(\cdot)=\max_{a}Q^{*}_{h,\lambda}(\cdot,a)$ . This completes the proof of this result. ∎

Appendix D Offline Robust $\varphi$ -regularized RL Results ☕☕☕

In this section, we set $V_{\max}=1/(1-\gamma)$ whenever we use results from Proposition 3. In the following, we use constants $c_{1},c_{2},c_{3}$ from Proposition 3.

We first prove Proposition 1 that directly follows from Lemma 1.

Proof of Proposition 1.

For each $(s,a)$ , consider the optimization problem in (3)

	$\displaystyle\inf_{P_{s,a}\in\mathcal{P}_{s,a}}$	$\displaystyle\big{(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[V(s^{\prime})]+\lambda D% _{\varphi}(P_{s,a},P^{o}_{s,a})\big{)}=-\sup_{P_{s,a}\in\mathcal{P}_{s,a}}\big% {(}\mathbb{E}_{s^{\prime}\sim P_{s,a}}[-V(s^{\prime})]-\lambda D_{\varphi}(P_{% s,a},P^{o}_{s,a})\big{)}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}-\inf_{\eta^{\prime}\in\mathbb{R% }}(\lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}[\varphi^{*}\left(\frac{-\eta% ^{\prime}-V(s^{\prime})}{\lambda}\right)]+\eta^{\prime})$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}-\inf_{\eta\in\mathbb{R}}(% \lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}[\varphi^{*}\left(\frac{\eta-V(s% ^{\prime})}{\lambda}\right)]-\eta)$
		$\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}-\inf_{\eta\in\Theta}(\lambda% \mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}[\varphi^{*}\left(\frac{\eta-V(s^{% \prime})}{\lambda}\right)]-\eta),$

where $(a)$ follows from Lemma 1, $(b)$ by setting $\eta=-\eta^{\prime}$ , and $(c)$ by Proposition 3. This completes the proof. ∎

We now prove Proposition 2 which mainly follows from Lemma 5.

Proof of Proposition 2.

Since the conjugate function $\varphi^{*}(\cdot)$ is continuous, define a continuous function in $\eta$ for each $(s,a)\in\mathcal{S}\times\mathcal{A}$ $h((s,a),\eta)=(\lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}\varphi^{*}\left(% {(\eta-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda}\right)-\eta)$ . We observe $h((s,a),\eta)$ in $(s,a)\in\mathcal{S}\times\mathcal{A}$ is $\Sigma(\mathcal{S}\times\mathcal{A})$ -measurable for each $\eta\in\Theta$ , where $\Theta$ is a bounded real line. This lemma now directly follows by similar arguments in the proof of Panaganti et al., (2022, Lemma 1). ∎

Now we state a result and provide its proof for the empirical risk minimization on the dual parameter.

Proposition 7 (Dual Optimization Error Bound).

Let $\widehat{g}_{f}$ be the dual optimization parameter from Algorithm 1 (Step 4) for the state-action value function $f$ and let $\mathcal{T}_{g}$ be as defined in (7). With probability at least $1-\delta$ , we have

\sup_{f\in\mathcal{F}}\|\mathcal{T}f-\mathcal{T}_{\widehat{g}_{f}}f\|_{1,\mu}% \leq 2\gamma c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{G}|)}{N}}+5c_{1}\sqrt{\frac% {2\log(8|\mathcal{F}|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}}.

Proof.

We adapt the proof from Panaganti et al., (2022, Lemma 6). We first fix $f\in\mathcal{F}$ . We will also invoke union bound for the supremum here. We recall from (8) that $\widehat{g}_{f}=\operatorname*{arg\,min}_{g\in\mathcal{G}}\widehat{L}_{\mathrm% {dual}}(g;f)$ . From the robust Bellman equation, we directly obtain

	$\displaystyle\\|\mathcal{T}_{\widehat{g}_{f}}f-\mathcal{T}f\\|_{1,\mu}$	$\displaystyle=\gamma(\mathbb{E}_{s,a\sim\mu}\|\mathbb{E}_{s^{\prime}\sim P^{o}_% {s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a^{\prime}}f(s^{\prime}% ,a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))$
		$\displaystyle\hskip 56.9055pt-\inf_{\eta\in\Theta}(\lambda\mathbb{E}_{s^{% \prime}\sim P^{o}_{s,a}}\varphi^{*}\left({(\eta-\max_{a^{\prime}}f(s^{\prime},% a^{\prime}))}/{\lambda}\right)-\eta)\|)$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\gamma(\mathbb{E}_{s,a\sim\mu}% \mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s% ,a)-\max_{a^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))$
		$\displaystyle\hskip 28.45274pt-\mathbb{E}_{s,a\sim\mu}[\inf_{\eta\in\Theta}(% \lambda\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}\varphi^{*}\left({(\eta-\max_{a^% {\prime}}f(s^{\prime},a^{\prime}))}/{\lambda}\right)-\eta))])$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\gamma(\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a^{% \prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))$
		$\displaystyle\hskip 28.45274pt-\inf_{g\in L^{1}(\mu)}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))$
		$\displaystyle=\gamma(\mathbb{E}_{s,a\sim\mu,s^{\prime}\sim P^{o}_{s,a}}(% \lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a^{\prime}}f(s^{\prime},a^{% \prime}))}/{\lambda})-\widehat{g}_{f}(s,a))$
		$\displaystyle\hskip 28.45274pt-\inf_{g\in\mathcal{G}}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))$
		$\displaystyle\hskip 28.45274pt+\gamma(\inf_{g\in\mathcal{G}}\mathbb{E}_{s,a% \sim\mu,s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{% \prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-g(s,a))$
		$\displaystyle\hskip 42.67912pt-\inf_{g\in L^{1}(\mu)}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))$
		$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\gamma(\mathbb{E}_{s,a\sim\mu% ,s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(\widehat{g}_{f}(s,a)-\max_{a% ^{\prime}}f(s^{\prime},a^{\prime}))}/{\lambda})-\widehat{g}_{f}(s,a))$
		$\displaystyle\hskip 28.45274pt-\inf_{g\in\mathcal{G}}\mathbb{E}_{s,a\sim\mu,s^% {\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda})-g(s,a)))+\gamma\varepsilon_{\mathcal{G}}$
		$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}2\gamma c_{2}c_{3}\sqrt{\frac% {2\log(\|\mathcal{G}\|)}{N}}+5c_{1}\sqrt{\frac{2\log(8/\delta)}{N}}+\gamma% \varepsilon_{\mathcal{G}}.$

$(a)$ follows since $\inf_{g}h(g)\leq h(\widehat{g}_{f})$ . $(b)$ follows from Proposition 2. $(c)$ follows from the approximate dual realizability assumption (Assumption 3).

For $(d)$ , we consider the loss function $l(g,(s,a,s^{\prime}))=\lambda\varphi^{*}\left({(g(s,a)-\max_{a^{\prime}}f(s^{% \prime},a^{\prime}))}/{\lambda}\right)-g(s,a)$ (for e.g. $l(g,(s,a,s^{\prime}))=[(g(s,a)+2\lambda-\max_{a^{\prime}}f(s^{\prime},a^{% \prime}))^{2}_{+}]/{4\lambda}-\lambda-g(s,a)$ ) and dataset $\mathcal{D}=\{s_{i},a_{i},s_{i}^{\prime}\}_{i=1}^{N}$ . Since $f\in\mathcal{F}$ and $g\in\mathcal{G}$ , we note that $|l(g,(s,a,s^{\prime}))|\leq c_{1}$ , where the value of $c_{1}>0$ depend on specific forms of $\varphi^{*}$ as demonstrated in Proposition 3. Furthermore, take $l(g,(s,a,s^{\prime}))$ to be $c_{2}$ -Lipschitz in $g$ and $|g(s,a)|\leq c_{3}$ , since $g\in\mathcal{G}$ , for some positive constants $c_{2}$ and $c_{3}$ . Again, these constants depend on specific forms of $\varphi^{*}$ as demonstrated in Proposition 3. With these insights, we can apply the empirical risk minimization result in Lemma 4 to get $(d)$ .

With union bound, with probability at least $1-\delta$ , we finally get

\displaystyle\sup_{f\in\mathcal{F}}\|\mathcal{T}f-\mathcal{T}_{\widehat{g}_{f}% }f\|_{1,\mu}\leq 2\gamma c_{2}c_{3}\sqrt{\frac{2\log(|\mathcal{G}|)}{N}}+5c_{1% }\sqrt{\frac{2\log(8|\mathcal{F}|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}},

which concludes the proof. ∎

We next prove the least-squares generalization bound for the RFQI algorithm.

Proposition 8 (Least squares generalization bound).

Let $\widehat{f}_{g}$ be the least-squares solution from Algorithm 1 (Step 5) for the state-action value function $f$ and dual variable function $g$ . Let $\mathcal{T}_{g}$ be as defined in (7). Then, with probability at least $1-\delta$ , we have

\displaystyle\sup_{f\in\mathcal{F}}\sup_{g\in\mathcal{G}}\|\mathcal{T}_{g}f-% \widehat{f}_{g}\|_{2,\mu}

\displaystyle\leq\sqrt{6\varepsilon_{\mathcal{F}}}+\sqrt{\frac{2}{(1-\gamma)^{% 2}}+18(1+\gamma c_{1})}\sqrt{\frac{18\log(2|\mathcal{F}||\mathcal{G}|/\delta)}% {N}}.

Proof.

We adapt the least-squares generalization bound given in Agarwal et al., (2019, Lemma A.11) to our setting. We recall from (9) that $\widehat{f}_{g}=\operatorname*{arg\,min}_{Q\in\mathcal{F}}\widehat{L}_{\mathrm% {robQ}}(Q;f,g)$ . We first fix functions $f\in\mathcal{F}$ and $g\in\mathcal{G}$ . For any function $f^{\prime}\in\mathcal{F}$ , we define random variables $z_{i}^{f^{\prime}}$ as

\displaystyle z_{i}^{f^{\prime}}=\left(f^{\prime}(s_{i},a_{i})-y_{i}\right)^{2% }-\left((\mathcal{T}_{g}f)(s_{i},a_{i})-y_{i}\right)^{2},

where $y_{i}=r_{i}-\gamma\lambda\varphi^{*}({(g(s_{i},a_{i})-\max_{a^{\prime}}f(s_{i}% ^{\prime},a^{\prime}))}/{\lambda})+\gamma g(s_{i},a_{i})$ , and $(s_{i},a_{i},s^{\prime}_{i})\in\mathcal{D}$ with $(s_{i},a_{i})\sim\mu,s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}$ . It is straightforward to note that for a given $(s_{i},a_{i})$ , we have $\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[y_{i}]=(\mathcal{T}_{g}f)(% s_{i},a_{i})$ . We note the randomness of $z_{i}^{f^{\prime}}$ given $f,f^{\prime}\in\mathcal{F}$ and $g\in\mathcal{G}$ is from the dataset pairs $(s_{i},a_{i},s_{i}^{\prime})$ .

Since $f,f^{\prime}\in\mathcal{F}$ and $g\in\mathcal{G}$ , from Proposition 3, we write both $(\mathcal{T}_{g}f)(s_{i},a_{i}),y_{i}\leq 1+\gamma c_{1}$ , where the value of $c_{1}>0$ depend on specific forms of $\varphi^{*}$ . Using this, we obtain the first moment and an upper-bound for the second moment of $z_{i}^{f^{\prime}}$ as follows:

	$\displaystyle\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[z_{i}^{f^{% \prime}}]$	$\displaystyle=\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(f^{\prime}(% s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))\cdot(f^{\prime}(s_{i},a_{i})+(% \mathcal{T}_{g}f)(s_{i},a_{i})-2y_{i})]$
		$\displaystyle=(f^{\prime}(s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))^{2},$
	$\displaystyle\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(z_{i}^{f^{% \prime}})^{2}]$	$\displaystyle=\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(f^{\prime}(% s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))^{2}\cdot(f^{\prime}(s_{i},a_{i})% +(\mathcal{T}_{g}f)(s_{i},a_{i})-2y_{i})^{2}]$
		$\displaystyle=(f^{\prime}(s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}))^{2}% \cdot\mathbb{E}_{s^{\prime}_{i}\sim P^{o}_{s_{i},a_{i}}}[(f^{\prime}(s_{i},a_{% i})+(\mathcal{T}_{g}f)(s_{i},a_{i})-2y_{i})^{2}]$
		$\displaystyle\leq C_{1}(f^{\prime}(s_{i},a_{i})-(\mathcal{T}_{g}f)(s_{i},a_{i}% ))^{2},$

where $C_{1}=\frac{2}{(1-\gamma)^{2}}+18(1+\gamma c_{1})$ . This immediately implies that

	$\displaystyle\mathbb{E}_{s_{i},a_{i}\sim\mu,s^{\prime}_{i}\sim P^{o}_{s_{i},a_% {i}}}[z_{i}^{f^{\prime}}]$	$\displaystyle=\left\\|\mathcal{T}_{g}f-f^{\prime}\right\\|^{2}_{2,\mu},$
	$\displaystyle\mathbb{E}_{s_{i},a_{i}\sim\mu,s^{\prime}_{i}\sim P^{o}_{s_{i},a_% {i}}}[(z_{i}^{f^{\prime}})^{2}]$	$\displaystyle\leq C_{1}\left\\|\mathcal{T}_{g}f-f^{\prime}\right\\|^{2}_{2,\mu}.$

From these calculations, it is also straightforward to see that $|z_{i}^{f^{\prime}}-\mathbb{E}_{s_{i},a_{i}\sim\mu,s^{\prime}_{i}\sim P^{o}_{s% _{i},a_{i}}}[z_{i}^{f^{\prime}}]|\leq 2C_{1}$ almost surely.

Now, using the Bernstein’s inequality (Lemma 2), together with a union bound over all $f^{\prime}\in\mathcal{F}$ , with probability at least $1-\delta$ , we have

\displaystyle|\|\mathcal{T}_{g}f-f^{\prime}\|_{2,\mu}^{2}-\frac{1}{N}\sum_{i=1% }^{N}z_{i}^{f^{\prime}}|\leq\sqrt{\frac{2C_{1}\|\mathcal{T}_{g}f-f^{\prime}\|_% {2,\mu}^{2}\log(2|\mathcal{F}|/\delta)}{N}}+\frac{2C_{1}\log(2|\mathcal{F}|/% \delta)}{3N},

(23)

for all $f^{\prime}\in\mathcal{F}$ . This expression coincides with Panaganti et al., (2022, Eq.(15)). Thus, following the proof of Panaganti et al., (2022, Lemma 7), we finally get

\displaystyle\|\mathcal{T}_{g}f-\widehat{f}_{g}\|_{2,\mu}^{2}

\displaystyle\leq 6\varepsilon_{\mathcal{F}}+\frac{9C_{1}\log(4|\mathcal{F}|/% \delta)}{N}.

(24)

We note a fact $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ . Now, using union bound for $f\in\mathcal{F}$ and $g\in\mathcal{G}$ , with probability at least $1-\delta$ , we finally obtain

\displaystyle\sup_{f\in\mathcal{F}}\sup_{g\in\mathcal{G}}\|\mathcal{T}_{g}f-% \widehat{f}_{g}\|_{2,\mu}

\displaystyle\leq\sqrt{6\varepsilon_{\mathcal{F}}}+\sqrt{\frac{18C_{1}\log(2|% \mathcal{F}||\mathcal{G}|/\delta)}{N}}.

This completes the least-squares generalization bound analysis for the robust regularized Bellman updates. ∎

We are now ready to prove the main theorem.

D.1 Proof of Theorem 1 ☕☕☕

Theorem 3 (Restatement of Theorem 1).

Let Assumptions 1, 2 and 3 hold. Let $\pi_{K}$ be the RPQ algorithm policy after $K$ iterations. Then, for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have

	$\displaystyle V^{\pi^{*}}-V^{\pi_{K}}\leq$	$\displaystyle\frac{2\gamma^{K}}{(1-\gamma)^{2}}+\frac{2\sqrt{C}}{(1-\gamma)^{2% }}(2\gamma c_{2}c_{3}\sqrt{\frac{2\log(\|\mathcal{G}\|)}{N}}+5c_{1}\sqrt{\frac{2% \log(8\|\mathcal{F}\|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}})$
		$\displaystyle+\frac{2\sqrt{C}}{(1-\gamma)^{2}}(\sqrt{6\varepsilon_{\mathcal{F}% }}+\sqrt{\frac{2}{(1-\gamma)^{2}}+18(1+\gamma c_{1})}\sqrt{\frac{18\log(2\|% \mathcal{F}\|\|\mathcal{G}\|/\delta)}{N}}).$

Proof.

We let $V_{k}(s)=Q_{k}(s,\pi_{k}(s))$ for every $s\in\mathcal{S}$ . Since $\pi_{k}$ is the greedy policy w.r.t $Q_{k}$ , we also have $V_{k}(s)=Q_{k}(s,\pi_{k}(s))=\max_{a}Q_{k}(s,a)$ . We recall that $V^{*}=V^{\pi^{*}}$ and $Q^{*}=Q^{\pi^{*}}$ . We also recall from Section 2 that $Q^{\pi^{*}}$ is a fixed-point of the robust Bellman operator $\mathcal{T}$ defined in (3). We also note that the same holds true for any stationary deterministic policy $\pi$ from Yang et al., (2023) that $Q^{\pi}$ satisfies $Q^{\pi}(s,a)=r(s,a)+\gamma\min_{P_{s,a}\ll P^{o}_{s,a}}(\mathbb{E}_{s^{\prime}% \sim P_{s,a}}[V^{\pi}(s^{\prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})).$ We now adapt the proof of Panaganti et al., (2022, Theorem 1) using the RRBE in its primal form (3) directly instead of its dual form (4).

We first characterize the performance decomposition between $V^{\pi^{*}}$ and ${V}^{\pi_{K}}$ . We recall the initial state distribution $d_{0}$ . Since $V^{\pi^{*}}(s)\geq V^{\pi_{K}}(s)$ for any $s\in\mathcal{S}$ , we observe that

$\displaystyle 0\leq$	$\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}(s_{0})-{V}^{\pi_{K}}(s_{% 0})]=\mathbb{E}_{s_{0}\sim d_{0}}[(V^{\pi^{}}(s_{0})-V_{K}(s_{0}))-(V^{\pi_{K% }}(s_{0})-V_{K}(s_{0}))]$
	$\displaystyle=\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-% Q_{K}(s_{0},\pi_{K}(s_{0})))-(Q^{\pi_{K}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},% \pi_{K}(s_{0})))]$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{*}(s_{0}))+Q_{K}(s_{0},\pi_% {K}(s_{0}))-Q^{\pi_{K}}(s_{0},\pi_{K}(s_{0}))]$
	$\displaystyle=\mathbb{E}_{s_{0}\sim d_{0}}[Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q% _{K}(s_{0},\pi^{}(s_{0}))+Q_{K}(s_{0},\pi_{K}(s_{0}))-Q^{\pi^{}}(s_{0},\pi_{% K}(s_{0}))$
	$\displaystyle\hskip 142.26378pt+Q^{\pi^{*}}(s_{0},\pi_{K}(s_{0}))-Q^{\pi_{K}}(% s_{0},\pi_{K}(s_{0}))]$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{}(s_{0}))+Q_{K}(s_{0},\pi_% {K}(s_{0}))-Q^{\pi^{}}(s_{0},\pi_{K}(s_{0}))$
	$\displaystyle\hskip 56.9055pt+\gamma[\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{% s_{0},\pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi^% {*}}(s_{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}% (s_{0})}))$
	$\displaystyle\hskip 85.35826pt-\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{s_{0},% \pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi_{K}}(s% _{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}(s_{0}% )}))]]$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% \|Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{}(s_{0}))\|]+\mathbb{E}_{s% _{0}\sim d_{0}}[\|Q^{\pi^{}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},\pi_{K}(s_{0}))\|]$
	$\displaystyle\hskip 113.81102pt+\gamma\mathbb{E}_{s_{0}\sim d_{0}}\mathbb{E}_{% s_{1}\sim P^{\pi_{K},\min}_{s_{0},\pi_{K}(s_{0})}}(\|V^{\pi^{*}}(s_{1})-V^{\pi_% {K}}(s_{1})\|)$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\sum_{h=0}^{\infty}\gamma^{h}% \cdot\bigg{(}\mathbb{E}_{s\sim d_{h,\pi_{K}}}[\|Q^{\pi^{}}(s,\pi^{}(s))-Q_{K}% (s,\pi^{}(s))\|+\|Q^{\pi^{}}(s,\pi_{K}(s))-Q_{K}(s,\pi_{K}(s))\|]\bigg{)},$	(25)

where $(a)$ follows from the fact that $\pi_{K}$ is the greedy policy with respect to $Q_{K}$ , $(b)$ from the Bellman equations, and $(c)$ from the following definition

P^{\pi_{K},\min}_{s,\pi_{K}(s)}\in\operatorname*{arg\,min}_{P_{s,\pi_{K}(s)}% \ll P^{o}_{s,\pi_{K}(s)}}(\mathbb{E}_{s^{\prime}\sim P_{s,\pi_{K}(s)}}[V^{\pi_% {K}}(s^{\prime})]+\lambda D_{\varphi}(P_{s,\pi_{K}(s)},P^{o}_{s,\pi_{K}(s)})).

We note that this worse-case model distribution can be non-unique and we just pick one by an arbitrary deterministic rule. We emphasize that this model distribution is used only in analysis which is not required in the algorithm. Finally, $(d)$ follows with telescoping over $|V^{\pi^{*}}-V^{\pi_{K}}|$ by defining a state distribution $d_{h,\pi_{K}}\in\Delta(\mathcal{S})$ , for all natural numbers $h\geq 0$ , as

d_{h,\pi_{K}}=\begin{cases}d_{0}&\text{if $h=0$},\\ P^{\pi_{K},\min}_{s^{\prime},\pi_{K}(s^{\prime})}&\text{otherwise, with }s^{% \prime}\sim d_{h-1,\pi_{K}}.\end{cases}

We note that such state distribution proof ideas are commonly used in the offline RL literature (Agarwal et al.,, 2019; Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023; Zhang et al.,, 2023).

For (25), with the $\nu$ -norm notation i.e. $\|f\|_{p,\nu}^{2}=(\mathbb{E}_{s,a\sim\nu}|f(s,a)|^{p})^{1/p}$ for any $\nu\in\Delta(\mathcal{S}\times\mathcal{A})$ , we have

\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[{V}^{\pi^{*}}]-\mathbb{E}_{s_{0}\sim d% _{0}}[V^{\pi_{K}}]\leq\sum_{h=0}^{\infty}\gamma^{h}\bigg{(}\|Q^{\pi^{*}}-Q_{K}% \|_{1,d_{h,\pi_{K}}\circ\pi^{*}}+\|Q^{\pi^{*}}-Q_{K}\|_{1,d_{h,\pi_{K}}\circ% \pi_{K}}\bigg{)},

(26)

where the state-action distributions are $d_{h,\pi_{K}}\circ\pi^{*}(s,a)\propto d_{h,\pi_{K}}(s)\mathds{1}(a=\pi^{*}(s))$ and $d_{h,\pi_{K}}\circ\pi_{K}(s,a)\propto d_{h,\pi_{K}}(s)\mathds{1}(a=\pi_{K}(s))$ . We now analyze the above two terms treating either $d_{h,\pi_{K}}\circ\pi^{*}$ or $d_{h,\pi_{K}}\circ\pi_{K}$ as a state-action distribution $\nu$ satisfying Assumption 1. First, considering any $s,a\sim\nu$ satisfying $Q^{\pi^{*}}(s,a)\geq Q_{K}(s,a)$ we have

$\displaystyle 0\leq$	$\displaystyle Q^{\pi^{}}(s,a)-Q_{K}(s,a)\leq Q^{\pi^{}}(s,a)-\mathcal{T}Q_{K% -1}(s,a)+\|\mathcal{T}Q_{K-1}(s,a)-Q_{K}(s,a)\|$
	$\displaystyle\leq Q^{\pi^{*}}(s,a)-\mathcal{T}Q_{K-1}(s,a)+\\|\mathcal{T}Q_{K-1% }-Q_{K}\\|_{1,\nu}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}Q^{\pi^{*}}(s,a)-\mathcal{T}Q% _{K-1}(s,a)+\sqrt{C}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(f)}}{{=}}\gamma[\min_{P_{s,a}\ll P^{o}_{s% ,a}}(\mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q^{\pi^{*}}(s^{% \prime},a^{\prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a}))$
	$\displaystyle\hskip 85.35826pt-\min_{P_{s,a}\ll P^{o}_{s,a}}(\mathbb{E}_{s^{% \prime}\sim P_{s,a}}[\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime})]+\lambda D% _{\varphi}(P_{s,a},P^{o}_{s,a}))]$
	$\displaystyle\hskip 170.71652pt+\sqrt{C}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}\gamma(\mathbb{E}_{s^{\prime}% \sim P^{Q_{K-1},\min}_{s,a}}(\max_{a^{\prime}}Q^{\pi^{*}}(s^{\prime},a^{\prime% })-\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime})))+\sqrt{C}\\|\mathcal{T}Q_{K% -1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(h)}}{{\leq}}\gamma(\mathbb{E}_{s^{\prime}% \sim P^{Q_{K-1},\min}_{s,a}}\max_{a^{\prime}}\|Q^{\pi^{*}}(s^{\prime},a^{\prime% })-Q_{K-1}(s^{\prime},a^{\prime})\|)+\sqrt{C}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,% \mu},$	(27)

where $(e)$ follows by the concentrability assumption (Assumption 1), $(f)$ from Bellman equation, operator $\mathcal{T}$ , $(g)$ follows, similarly as step $(c)$ , from the following definition

P^{Q_{K-1},\min}_{s,a}\in\operatorname*{arg\,min}_{P_{s,a}\ll P^{o}_{s,a}}(% \mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{% \prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a})).

We again emphasize that this model distribution is analysis-specific and we just pick one by an arbitrary deterministic rule since it may not be unique. $(h)$ follows by the fact $|\sup_{x}p(x)-\sup_{x}q(x)|\leq\sup_{x}|p(x)-q(x)|$ . Now, by replacing $P^{Q_{K-1},\min}$ with $P^{Q^{\pi^{*}},\min}$ in step $(g)$ and repeating the steps for any $s,a\sim\nu$ satisfying $Q^{\pi^{*}}(s,a)\leq Q_{K}(s,a)$ , we get

\displaystyle 0\leq Q_{K}(s,a)-Q^{\pi^{*}}(s,a)\leq\gamma(\mathbb{E}_{s^{% \prime}\sim P^{Q^{\pi^{*}},\min}_{s,a}}\max_{a^{\prime}}|Q^{\pi^{*}}(s^{\prime% },a^{\prime})-Q_{K-1}(s^{\prime},a^{\prime})|)+\sqrt{C}\|\mathcal{T}Q_{K-1}-Q_% {K}\|_{1,\mu}.

(28)

We immediately note that both $P^{Q_{K-1},\min}_{s,a}$ and $P^{Q^{\pi^{*}},\min}_{s,a}$ satisfies $D_{\varphi}(P^{Q_{K-1},\min}_{s,a},P^{o}_{s,a})\leq 1/(\lambda(1-\gamma))$ and $D_{\varphi}(P^{Q^{\pi^{*}},\min}_{s,a},P^{o}_{s,a})\leq 1/(\lambda(1-\gamma))$ , which follows by their definition and the facts $Q_{K-1}\in\mathcal{F}$ , $\|Q^{\pi^{*}}\|_{\infty}\leq 1/(1-\gamma)$ . Define the state-action probability distribution $\nu^{\prime}$ as, for any $s^{\prime},a^{\prime}$ ,

	$\displaystyle\nu^{\prime}(s^{\prime},a^{\prime})$	$\displaystyle=\sum_{s,a}\nu(s,a)\mathds{1}\{Q^{\pi^{}}(s,a)>Q_{K}(s,a)\}P^{Q_% {K-1},\min}_{s,a}(s^{\prime})\mathds{1}\{a^{\prime}=\operatorname{arg\,max}_{% b}\|Q^{\pi^{*}}(s^{\prime},b)-Q_{K-1}(s^{\prime},b)\|\}$
		$\displaystyle\hskip 28.45274pt+\sum_{s,a}\nu(s,a)\mathds{1}\{Q^{\pi^{}}(s,a)% \leq Q_{K}(s,a)\}P^{Q^{\pi^{}},\min}_{s,a}(s^{\prime})\mathds{1}\{a^{\prime}=% \operatorname{arg\,max}_{b}\|Q^{\pi^{}}(s^{\prime},b)-Q_{K-1}(s^{\prime},b)\|\}.$

Now, we can combine (27)-(28) as follows

	$\displaystyle\\|Q^{\pi^{*}}-Q_{K}\\|_{1,\nu}$	$\displaystyle\leq\gamma\\|Q^{\pi^{*}}-Q_{K-1}\\|_{1,\nu^{\prime}}+\sqrt{C}\\|% \mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\gamma\\|Q^{\pi^{*}}-Q_{K-1}\\|% _{1,\nu^{\prime}}+\sqrt{C}\\|\mathcal{T}_{g_{K-1}}Q_{K-1}-Q_{K}\\|_{2,\mu}+\sqrt% {C}\\|\mathcal{T}Q_{K-1}-\mathcal{T}_{g_{K-1}}Q_{K-1}\\|_{1,\mu},$

where $(i)$ uses the fact $\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}$ .

Now, by recursion until iteration 0, we get

$\displaystyle\\|$	$\displaystyle Q^{\pi^{}}-Q_{K}\\|_{1,\nu}\leq\gamma^{K}\sup_{\bar{\nu}}\\|Q^{% \pi^{}}-Q_{0}\\|_{1,\bar{\nu}}+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T}% Q_{K-1-t}-\mathcal{T}_{g_{K-1-t}}Q_{K-1-t}\\|_{1,\mu}$
	$\displaystyle\hskip 113.81102pt+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T% }_{g_{K-1-t}}Q_{K-1-t}-Q_{K-t}\\|_{2,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(j)}}{{\leq}}\frac{\gamma^{K}}{1-\gamma}+% \sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T}Q_{K-1-t}-\mathcal{T}_{g_{K-1-t% }}Q_{K-1-t}\\|_{1,\mu}$
	$\displaystyle\hskip 113.81102pt+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T% }_{g_{K-1-t}}Q_{K-1-t}-Q_{K-t}\\|_{2,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(k)}}{{\leq}}\frac{\gamma^{K}}{1-\gamma}+% \frac{\sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\\|\mathcal{T}f-\mathcal{T}_{% \widehat{g}_{f}}f\\|_{1,\mu}+\frac{\sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\\|% \mathcal{T}_{\widehat{g}_{f}}f-\widehat{f}_{\widehat{g}_{f}}\\|_{2,\mu}$
	$\displaystyle\leq\frac{\gamma^{K}}{1-\gamma}+\frac{\sqrt{C}}{1-\gamma}\sup_{f% \in\mathcal{F}}\\|\mathcal{T}f-\mathcal{T}_{\widehat{g}_{f}}f\\|_{1,\mu}+\frac{% \sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\sup_{g\in\mathcal{G}}\\|\mathcal{T}_{% g}f-\widehat{f}_{g}\\|_{2,\mu}.$	(29)

where $(j)$ follows since $|Q^{\pi^{*}}(s,a)|\leq 1/(1-\gamma),Q_{0}(s,a)=0$ , and $(k)$ follows since $\widehat{g}_{f}$ is the dual variable function from the algorithm for the state-action value function $f$ and $\widehat{f}_{g}$ as the least squares solution from the algorithm for the state-action value function $f$ and dual variable function $g$ pair.

Now, using Lemma 7 and Lemma 8 to bound (29), and then combining it with (26), completes the proof of this theorem. ∎

D.2 Specialized Result for TV $\varphi$ -divergence ☕☕☕

We now state and prove the improved (in terms of assumptions) result for TV $\varphi$ -divergence.

Assumption 9 (Concentrability).

There exists a finite constant $C_{\mathrm{tv}}>0$ such that for any $\nu\in\{d_{\pi,P^{o}}\}\subseteq\Delta(\mathcal{S}\times\mathcal{A})$ for any policy $\pi$ (can be non-stationary as well), we have $\left\|\nu/\mu\right\|_{\infty}\leq\sqrt{C_{\mathrm{tv}}}$ .

Assumption 10 (Fail-state).

There is a fail state $s_{f}$ such that $r(s_{f},a)=0$ and $P_{s_{f},a}(s_{f})=1$ , for all $a\in\mathcal{A}$ and $P\in\mathcal{P}$ satisfying $D_{\mathrm{TV}}(P_{s^{\prime},a^{\prime}},P^{o}_{s^{\prime},a^{\prime}})\leq% \max\{1,1/(\lambda(1-\gamma))\}$ for all $s^{\prime},a^{\prime}$ .

Theorem 4.

Let Assumptions 9, 2, 3 and 10 hold. Let $\pi_{K}$ be the RPQ algorithm policy after $K$ iterations. Then, for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have

	$\displaystyle V^{\pi^{*}}-V^{\pi_{K}}\leq$	$\displaystyle\frac{2\gamma^{K}}{(1-\gamma)^{2}}+\frac{2\sqrt{C_{\mathrm{tv}}}}% {(1-\gamma)^{2}}(2\gamma c_{2}c_{3}\sqrt{\frac{2\log(\|\mathcal{G}\|)}{N}}+5c_{1% }\sqrt{\frac{2\log(8\|\mathcal{F}\|/\delta)}{N}}+\gamma\varepsilon_{\mathcal{G}})$
		$\displaystyle+\frac{2\sqrt{C_{\mathrm{tv}}}}{(1-\gamma)^{2}}(\sqrt{6% \varepsilon_{\mathcal{F}}}+\sqrt{\frac{2}{(1-\gamma)^{2}}+18(1+\gamma c_{1})}% \sqrt{\frac{18\log(2\|\mathcal{F}\|\|\mathcal{G}\|/\delta)}{N}}),$

with $c_{1}=2\lambda+(1/(1-\gamma)),c_{2}=2,c_{3}=\lambda/2$ .

Proof.

We can now further use the dual form (4) under Assumption 10. We again start by characterizing the performance decomposition between $V^{\pi^{*}}$ and ${V}^{\pi_{K}}$ . This proof largely follows the proofs of Theorem 1 and Panaganti et al., (2022, Theorem 1). In particular, we use the total variation RRBE its dual form (4) under Assumption 10 in this proof. That is, for all $\pi$ and $Q\in\mathcal{F}$ , from (17) we have

	$\displaystyle Q^{\pi}(s,a)$	$\displaystyle=r(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}\sim P% ^{o}_{s,a}}[(\eta-V^{\pi}(s^{\prime}))_{+}]-\eta)\text{ and}$		(30)
	$\displaystyle(\mathcal{T}Q)(s,a)$	$\displaystyle=r(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}\sim P% ^{o}_{s,a}}[(\eta-\max_{a^{\prime}}Q(s^{\prime},a^{\prime}))_{+}]-\eta).$

We recall the initial state distribution $d_{0}$ . Since $V^{\pi^{*}}(s)\geq V^{\pi_{K}}(s)$ for any $s\in\mathcal{S}$ , we begin with step $(b)$ in Theorem 1:

$\displaystyle 0\leq$	$\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}(s_{0})-{V}^{\pi_{K}}(s_{% 0})]$
	$\displaystyle\leq\mathbb{E}_{s_{0}\sim d_{0}}[Q^{\pi^{}}(s_{0},\pi^{}(s_{0})% )-Q_{K}(s_{0},\pi^{}(s_{0}))+Q_{K}(s_{0},\pi_{K}(s_{0}))-Q^{\pi^{}}(s_{0},% \pi_{K}(s_{0}))$
	$\displaystyle\hskip 56.9055pt+\gamma[\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{% s_{0},\pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi^% {*}}(s_{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}% (s_{0})}))$
	$\displaystyle\hskip 85.35826pt-\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{s_{0},% \pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi_{K}}(s% _{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}(s_{0}% )}))]]$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% \|Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{}(s_{0}))\|]+\mathbb{E}_{s% _{0}\sim d_{0}}[\|Q^{\pi^{}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},\pi_{K}(s_{0}))\|]$
	$\displaystyle\hskip 113.81102pt+\gamma\mathbb{E}_{s_{0}\sim d_{0}}\sup_{\eta}(% \mathbb{E}_{s_{1}\sim P^{o}_{s_{0},\pi_{K}(s_{0})}}((\eta-V^{\pi_{K}}(s_{1}))_% {+}-(\eta-V^{\pi^{*}}(s_{1}))_{+}))$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% \|Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{}(s_{0}))\|]+\mathbb{E}_{s% _{0}\sim d_{0}}[\|Q^{\pi^{}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},\pi_{K}(s_{0}))\|]$
	$\displaystyle\hskip 113.81102pt+\gamma\mathbb{E}_{s_{0}\sim d_{0}}\mathbb{E}_{% s_{1}\sim P^{o}_{s_{0},\pi_{K}(s_{0})}}(\|V^{\pi^{*}}(s_{1})-V^{\pi_{K}}(s_{1})\|)$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\sum_{h=0}^{\infty}\gamma^{h}% \times\bigg{(}\mathbb{E}_{s\sim d_{h,\pi_{K}}}[\|Q^{\pi^{}}(s,\pi^{}(s))-Q_{K% }(s,\pi^{}(s))\|+\|Q^{\pi^{}}(s,\pi_{K}(s))-Q_{K}(s,\pi_{K}(s))\|]\bigg{)},$	(31)

where $(a)$ follows from (30) and the fact $|\sup_{x}f(x)-\sup_{x}g(x)|\leq\sup_{x}|f(x)-g(x)|$ , $(b)$ follows from the facts $(x)_{+}-(y)_{+}\leq(x-y)_{+}$ and $(x)_{+}\leq|x|$ for any $x,y\in\mathbb{R}$ . We make an important note here in step $(b)$ regarding the dependence on the nominal model $P^{o}$ distribution unlike in step $(c)$ in the proof of Theorem 1. This important step helps us improve the concentrability assumption in further analysis. Finally, $(c)$ follows with telescoping over $|V^{\pi^{*}}-V^{\pi_{K}}|$ by defining a new state distribution $d_{h,\pi_{K}}\in\Delta(\mathcal{S})$ , for all natural numbers $h\geq 0$ , as

d_{h,\pi_{K}}=\begin{cases}d_{0}&\text{if $h=0$},\\ P^{o}_{s^{\prime},\pi_{K}(s^{\prime})}&\text{otherwise, with }s^{\prime}\sim d% _{h-1,\pi_{K}}.\end{cases}

For (31), with the $\nu$ -norm notation i.e. $\|f\|_{p,\nu}^{2}=(\mathbb{E}_{s,a\sim\nu}|f(s,a)|^{p})^{1/p}$ for any $\nu\in\Delta(\mathcal{S}\times\mathcal{A})$ , we have

	$\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[{V}^{\pi^{*}}]-\mathbb{E}_{s_{0}\sim d% _{0}}[V^{\pi_{K}}]$	$\displaystyle\leq\sum_{h=0}^{\infty}\gamma^{h}\bigg{(}\\|Q^{\pi^{}}-Q_{K}\\|_{1% ,d_{h,\pi_{K}}\circ\pi^{}}+\\|Q^{\pi^{*}}-Q_{K}\\|_{1,d_{h,\pi_{K}}\circ\pi_{K}% }\bigg{)},$
		$\displaystyle\leq\sum_{h=0}^{\infty}\gamma^{h}(2\sup_{\nu}\\|Q^{\pi^{*}}-Q_{K}% \\|_{1,\nu}),$		(32)

where the second inequality follows since both $d_{h,\pi_{K}}\circ\pi^{*}$ and $d_{h,\pi_{K}}\circ\pi_{K}$ satisfy Assumption 9. We now analyze the summand in (26):

	$\displaystyle\\|Q^{\pi^{}}-Q_{K}\\|_{1,\nu}\leq\\|Q^{\pi^{}}-\mathcal{T}Q_{K-1}% \\|_{1,\nu}+\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\nu}$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\\|Q^{\pi^{*}}-\mathcal{T}Q_{K% -1}\\|_{1,\nu}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle=(\mathbb{E}_{s,a\sim\nu}\|Q^{\pi^{*}}(s,a)-\mathcal{T}Q_{K-1}(s,a% )\|)+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}(\mathbb{E}_{s,a\sim\nu}% \gamma\sup_{\eta}\|\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}((\eta-\max_{a^{% \prime}}Q_{K-1}(s^{\prime},a^{\prime}))_{+}-(\eta-\max_{a^{\prime}}Q^{\pi^{*}}% (s^{\prime},a^{\prime}))_{+})\|)$
	$\displaystyle\hskip 170.71652pt+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{% K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(f)}}{{\leq}}(\mathbb{E}_{s,a\sim\nu}\|% \mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}(\max_{a^{\prime}}Q^{\pi^{*}}(s^{\prime% },a^{\prime})-\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime}))_{+}\|)+\sqrt{C_{% \mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}\gamma(\mathbb{E}_{s,a\sim\nu% }\mathbb{E}_{s^{\prime}\sim{P}^{o}_{s,a}}\max_{a^{\prime}}\|Q^{\pi^{*}}(s^{% \prime},a^{\prime})-Q_{K-1}(s^{\prime},a^{\prime})\|)+\sqrt{C_{\mathrm{tv}}}\\|% \mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(h)}}{{\leq}}\gamma\\|Q^{\pi^{*}}-Q_{K-1}\\|% _{1,\nu^{\prime}}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\gamma\\|Q^{\pi^{*}}-Q_{K-1}\\|% _{1,\nu^{\prime}}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}_{g_{K-1}}Q_{K-1}-Q_{K}\\|% _{2,\mu}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-\mathcal{T}_{g_{K-1}}Q_{K-% 1}\\|_{1,\mu},$

where $(d)$ follows by Assumption 9, $(e)$ from Eq. 30 and the fact $|\sup_{x}p(x)-\sup_{x}q(x)|\leq\sup_{x}|p(x)-q(x)|$ , $(f)$ from the fact $|(x)_{+}-(y)_{+}|\leq|(x-y)_{+}|$ , $(g)$ follows by Jensen’s inequality and by the facts $|\sup_{x}p(x)-\sup_{x}q(x)|\leq\sup_{x}|p(x)-q(x)|$ and $(x)_{+}\leq|x|$ , $(h)$ follows by defining the distribution $\nu^{\prime}$ as $\nu^{\prime}(s^{\prime},a^{\prime})=\sum_{s,a}\nu(s,a){P}^{o}_{s,a}(s^{\prime}% )\mathds{1}\{a^{\prime}=\operatorname*{arg\,max}_{b}|Q^{\pi^{*}}(s^{\prime},b)% -Q_{K-1}(s^{\prime},b)|\}$ , and $(i)$ using the fact that $\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}$ . The rest of the proof follows similarly as in the proof of Theorem 1. ∎

Appendix E Hybrid Robust $\varphi$ -regularized RL Results ☕☕☕☕

In this section, we set $V_{\max}=H$ whenever we use results from Proposition 3. We remark that we have attempted to optimize the absolute constants inside $\log$ factors of the performance guarantees. In the following, we use constants $c_{1},c_{2},c_{3}$ from Proposition 3.

Now we provide an extension of Proposition 7 using Proposition 4 when the data comes from adaptive sampling.

Proposition 9 (Online Dual Optimization Error Bound).

Fix $\delta\in(0,1)$ . For $k\in\{0,1,\cdots,K-1\}$ , $h\in\{0,1,\cdots,H-1\}$ , let $g^{k}_{h}$ be the dual optimization function from Algorithm 2 (Step 4) for the state-action value function $Q^{k}_{h+1}$ using samples in the dataset $\{\mathcal{D}^{\mu}_{h},\mathcal{D}^{0}_{h},\cdots,\mathcal{D}^{k-1}_{h}\}$ . Let $\mathcal{T}_{g}$ be as defined in (14) and let $N=m_{\mathrm{off}}+K\cdot m_{\mathrm{on}}$ . Then, with probability at least $1-\delta$ , we have

	$\displaystyle\\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}\\|_{1,% \mu_{h}}\leq\frac{1}{m_{\mathrm{off}}}\left(3\varepsilon_{\mathcal{G}}N+48c_{1% }\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)\right)=\Delta_{\mathrm{dual,off}}% \quad\text{and}$
	$\displaystyle\sum_{\tau=0}^{k-1}\\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h% }}Q^{k}_{h+1}\\|_{1,d_{h}^{\pi_{\tau}}}\leq\frac{1}{m_{\mathrm{on}}}\left(3% \varepsilon_{\mathcal{G}}N+48c_{1}\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)% \right)=\Delta_{\mathrm{dual,on}}.$

Proof.

Fix $k\in\{0,1,\cdots,K-1\}$ , $h\in\{0,1,\cdots,H-1\}$ , $Q^{k}_{h+1}\in\mathcal{F}_{h+1}$ . The algorithm solves for $g^{k}_{h}$ in the empirical risk minimization step as:

\displaystyle g^{k}_{h}=\operatorname*{arg\,min}_{g\in\mathcal{G}_{h}}\widehat% {L}_{\mathrm{dual}}(g;Q^{k}_{h+1},\mathcal{D}),

where dataset $\mathcal{D}=\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i\leq N}$ with $N=m_{\mathrm{off}}+k\cdot m_{\mathrm{on}}$ . The first $m_{\mathrm{off}}$ samples in $\mathcal{D}$ are $\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i\leq m_{\mathrm{off}}}=\mathcal{D}^{% \mu}_{h}$ (recall that these are generated by the offline state-action distribution $\mu_{h}$ ), the next $m_{\mathrm{on}}$ samples are $\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}% +m_{\mathrm{on}}}=\mathcal{D}^{0}_{h}$ (recall that these are generated by the state-action distribution $d_{h}^{\pi_{0}}$ ), and so on where the samples $\{(s^{i}_{h},a^{i}_{h},s^{i}_{h+1})\}_{i=m_{\mathrm{off}}+\tau\cdot m_{\mathrm% {on}}+1}^{m_{\mathrm{off}}+(\tau+1)m_{\mathrm{on}}}=\mathcal{D}^{\tau}_{h}$ (recall that these are generated by the state-action distribution $d_{h}^{\pi_{\tau}}$ ) for all $\tau\leq k-1$ . We first have the following from step (b) in the proof of Proposition 7:

	$\displaystyle m_{\mathrm{off}}$	$\displaystyle\\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}\\|_{1,% \mu}+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}\\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g% ^{k}_{h}}Q^{k}_{h+1}\\|_{1,d_{h}^{\pi_{\tau}}}$
		$\displaystyle=m_{\mathrm{off}}[\mathbb{E}_{s,a\sim\mu_{h},s^{\prime}\sim P^{o}% _{s,a}}(\lambda\varphi^{*}({(g^{k}_{h}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})-g^{k}_{h}(s,a))$
		$\displaystyle\hskip 28.45274pt-\inf_{g\in L^{1}(\mu_{h})}\mathbb{E}_{s,a\sim% \mu_{h},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g(s,a)-\max_{a^{% \prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g(s,a))]$
		$\displaystyle\hskip 28.45274pt+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}[\mathbb{E}_{% s,a\sim d_{h}^{\pi_{\tau}},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g^% {k}_{h}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-% g^{k}_{h}(s,a))$
		$\displaystyle\hskip 28.45274pt-\inf_{g\in L^{1}(d_{h}^{\pi_{\tau}})}\mathbb{E}% _{s,a\sim d_{h}^{\pi_{\tau}},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(% g(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g(s,a))]$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}m_{\mathrm{off}}[\mathbb{E}_{s,a% \sim\mu_{h},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g^{k}_{h}(s,a)-% \max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g^{k}_{h}(s,a))$
		$\displaystyle\hskip 28.45274pt-\mathbb{E}_{s,a\sim\mu_{h},s^{\prime}\sim P^{o}% _{s,a}}(\lambda\varphi^{}({(g^{}_{-1}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{% \prime},a^{\prime}))}/{\lambda})-g^{*}_{-1}(s,a))]$
		$\displaystyle\hskip 28.45274pt+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}[\mathbb{E}_{% s,a\sim d_{h}^{\pi_{\tau}},s^{\prime}\sim P^{o}_{s,a}}(\lambda\varphi^{*}({(g^% {k}_{h}(s,a)-\max_{a^{\prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-% g^{k}_{h}(s,a))$
		$\displaystyle\hskip 28.45274pt-\mathbb{E}_{s,a\sim d_{h}^{\pi_{\tau}},s^{% \prime}\sim P^{o}_{s,a}}(\lambda\varphi^{}({(g^{}_{\tau}(s,a)-\max_{a^{% \prime}}Q^{k}_{h+1}(s^{\prime},a^{\prime}))}/{\lambda})-g^{*}_{\tau}(s,a))]$
		$\displaystyle=\sum_{i=1}^{m_{\mathrm{off}}}\mathbb{E}_{s^{i}_{h},a^{i}_{h}\sim% \mu_{h},s^{i}_{h+1}\sim P^{o}_{s^{i}_{h},a^{i}_{h}}}[(\lambda\varphi^{*}({(g^{% k}_{h}(s^{i}_{h},a^{i}_{h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime% }))}/{\lambda})-g^{k}_{h}(s^{i}_{h},a^{i}_{h}))$
		$\displaystyle\hskip 28.45274pt-(\lambda\varphi^{}({(g^{}_{-1}(s^{i}_{h},a^{i% }_{h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})-g^{*}% _{-1}(s^{i}_{h},a^{i}_{h}))]$
		$\displaystyle\hskip 28.45274pt+\sum_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}+m% _{\mathrm{on}}}\mathbb{E}_{s^{i}_{h},a^{i}_{h}\sim d_{h}^{\pi_{0}},s^{i}_{h+1}% \sim P^{o}_{s^{i}_{h},a^{i}_{h}}}[(\lambda\varphi^{*}({(g^{k}_{h}(s^{i}_{h},a^% {i}_{h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})-g^{% k}_{h}(s^{i}_{h},a^{i}_{h}))$
		$\displaystyle\hskip 56.9055pt-(\lambda\varphi^{}({(g^{}_{0}(s^{i}_{h},a^{i}_% {h})-\max_{a^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})-g^{*}_{% 0}(s^{i}_{h},a^{i}_{h}))]$
		$\displaystyle\hskip 56.9055pt+\cdots$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}3\varepsilon_{\mathcal{G}}N+4% 8c_{1}\log(2\|\mathcal{G}\|\|\mathcal{F}\|/\delta),$

where $(a)$ follows by defining the corresponding true solutions $g^{*}_{\tau}$ for all $\tau\in\{-1,0,1,\cdots,k-1\}$ . For $(b)$ with the empirical risk minimization solution $g^{k}_{h}$ , we use Proposition 4 by setting $c=c_{1}$ (with $c_{1}$ , constant dependent on $H$ and $\lambda$ , from Proposition 3) and since $g^{k}_{h}\in\mathcal{G}_{h},Q^{k}_{h+1}\in\mathcal{F}_{h+1}$ with sizes $|\mathcal{G}_{h}|\leq|\mathcal{G}|$ and $|\mathcal{F}_{h+1}|\leq|\mathcal{F}|$ under the union bound. Taking a union bound over $k\in\{0,1,\cdots,K-1\}$ , $h\in\{0,1,\cdots,H-1\}$ , and bounding each term separately, completes the proof. ∎

Now we provide an extension of Proposition 8 using Lemma 7 when the data comes from adaptive sampling.

Proposition 10 (Online Least-squares Generalization Bound).

Fix $\delta\in(0,1)$ . For $k\in\{0,1,\cdots,K-1\}$ , $h\in\{0,1,\cdots,H-1\}$ , let $Q^{k}_{h}$ be the least-squares solution from Algorithm 2 (Step 5) for the state-action value function $Q^{k}_{h+1}$ and dual variable function $g^{k}_{h}$ using samples in the dataset $\{\mathcal{D}^{\mu}_{h},\mathcal{D}^{0}_{h},\cdots,\mathcal{D}^{k-1}_{h}\}$ . Let $\mathcal{T}_{g}$ be as defined in (14) and let $N=m_{\mathrm{off}}+K\cdot m_{\mathrm{on}}$ . Then, with probability at least $1-\delta$ , we have

	$\displaystyle\\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\\|_{2,\mu_{h}}\leq% \frac{1}{\sqrt{m_{\mathrm{off}}}}\left(\sqrt{3\varepsilon_{\mathcal{F},\mathrm% {r}}N}+8(1+c_{1}+H)\sqrt{\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)}\right)=% \Delta_{\mathrm{rQ,off}}\quad\text{and}$
	$\displaystyle\sqrt{\sum_{\tau=0}^{k-1}\\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{% k}_{h}\\|_{2,d_{h}^{\pi_{\tau}}}^{2}}\leq\frac{1}{\sqrt{m_{\mathrm{on}}}}\left(% \sqrt{3\varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+c_{1}+H)\sqrt{\log(2HK\|% \mathcal{G}\|\|\mathcal{F}\|/\delta)}\right)=\Delta_{\mathrm{rQ,on}}.$

Proof.

We adapt the proof of Song et al., (2023, Lemma 7) here. Fix $k\in\{0,1,\cdots,K-1\}$ , $h\in\{0,1,\cdots,H-1\}$ , $g^{k}_{h}\in\mathcal{G}_{h}$ , and $Q^{k}_{h+1}\in\mathcal{F}_{h+1}$ . The algorithm solves for $Q^{k}_{h}$ in the least-squares regression step as:

\displaystyle Q^{k}_{h}=\operatorname*{arg\,min}_{Q\in\mathcal{F}_{h}}\widehat% {L}_{\mathrm{robQ}}(Q;Q^{k}_{h+1},g^{k}_{h},\mathcal{D}),

where dataset $\mathcal{D}=\{(x_{i},y_{i})\}_{i\leq N}$ with $N=m_{\mathrm{off}}+k\cdot m_{\mathrm{on}}$ and

\displaystyle x_{i}=(s^{i}_{h},a^{i}_{h})\qquad\text{and}\qquad y_{i}=r_{h}(s^% {i}_{h},a^{i}_{h})-\lambda\varphi^{*}({(g^{k}_{h}(s^{i}_{h},a^{i}_{h})-\max_{a% ^{\prime}}Q^{k}_{h+1}(s^{i}_{h+1},a^{\prime}))}/{\lambda})+g^{k}_{h}(s^{i}_{h}% ,a^{i}_{h}).

The first $m_{\mathrm{off}}$ samples in $\mathcal{D}$ are $\{(x_{i},y_{i})\}_{i\leq m_{\mathrm{off}}}=\mathcal{D}^{\mu}_{h}$ (recall that these are generated by the offline state-action distribution $\mu_{h}$ ), the next $m_{\mathrm{on}}$ samples are $\{(x_{i},y_{i})\}_{i=m_{\mathrm{off}}+1}^{m_{\mathrm{off}}+m_{\mathrm{on}}}=% \mathcal{D}^{0}_{h}$ (recall that these are generated by the state-action distribution $d_{h}^{\pi_{0}}$ ), and so on where the samples $\{(x_{i},y_{i})\}_{i=m_{\mathrm{off}}+\tau\cdot m_{\mathrm{on}}+1}^{m_{\mathrm% {off}}+(\tau+1)m_{\mathrm{on}}}=\mathcal{D}^{\tau}_{h}$ (recall that these are generated by the state-action distribution $d_{h}^{\pi_{\tau}}$ ) for all $\tau\leq k-1$ .

For using Lemma 7, we first note for any sample $(x,y)$ in $\mathcal{D}$ with $x=(s_{h},a_{h})$ and $y=(r_{h}(s_{h},a_{h})-\lambda\varphi^{*}({(g^{k}_{h}(s_{h},a_{h})-\max_{a^{% \prime}\in\mathcal{A}_{h+1}}Q^{k}_{h+1}(s_{h+1},a^{\prime}))}/{\lambda})+g^{k}% _{h}(s_{h},a_{h}))$ , there exists some $f_{h+1}\in\mathcal{F}_{h+1}$ by Assumption 5 such that the following holds:

	$\displaystyle\mathbb{E}[y\mid x]$	$\displaystyle=\mathbb{E}_{s_{h+1}\sim P^{o}_{h,s_{h},a_{h}}}(r_{h}(s_{h},a_{h}% )-\lambda\varphi^{*}({(g^{k}_{h}(s_{h},a_{h})-\max_{a^{\prime}\in\mathcal{A}_{% h+1}}Q^{k}_{h+1}(s_{h+1},a^{\prime}))}/{\lambda})+g^{k}_{h}(s_{h},a_{h}))$
		$\displaystyle=\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}(s_{h},a_{h})\leq f_{h+1}(s_{h% },a_{h}).$

We also note for any sample in $\mathcal{D}$ , $|y|\leq 1+c_{1}$ (with $c_{1}$ , constant dependent on $H$ and $\lambda$ , from Proposition 3) and $f_{h+1}(s,a)\leq H$ for all $s,a$ . With these notes, applying Lemma 7, we get that the least square regression solution $Q^{k}_{h}$ satisfies

\displaystyle\sum_{i=1}^{N}\mathbb{E}[(\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}(x_{i% })-Q^{k}_{h}(x_{i}))^{2}\mid\mathcal{D}]

\displaystyle\leq 3\varepsilon_{\mathcal{F},\mathrm{r}}N+64(1+c_{1}+H)^{2}\log% (2|\mathcal{G}||\mathcal{F}|/\delta)

with probability at least $1-\delta$ , since $g^{k}_{h}\in\mathcal{G}_{h}$ and $Q^{k}_{h+1}\in\mathcal{F}_{h+1}$ with sizes $|\mathcal{G}_{h}|\leq|\mathcal{G}|$ and $|\mathcal{F}_{h+1}|\leq|\mathcal{F}|$ under the union bound. Recall the samples in $\mathcal{D}^{\mu}_{h}$ are independently and identically drawn from the offline distribution $\mu_{h}$ , and the samples in $\mathcal{D}^{\tau}_{h}$ are independently and identically drawn from the state-action distribution $d_{h}^{\pi_{\tau}}$ . Thus we can further write as

\displaystyle m_{\mathrm{off}}\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|% _{2,\mu}^{2}+m_{\mathrm{on}}\sum_{\tau=0}^{k-1}\|\mathcal{T}_{g^{k}_{h}}Q^{k}_% {h+1}-Q^{k}_{h}\|_{2,d_{h}^{\pi_{\tau}}}^{2}

\displaystyle\leq 3\varepsilon_{\mathcal{F},\mathrm{r}}N+64(1+c_{1}+H)^{2}\log% (2|\mathcal{G}||\mathcal{F}|/\delta).

Taking a union bound over $k\in\{0,1,\cdots,K-1\}$ , $h\in\{0,1,\cdots,H-1\}$ , bounding each term separately, and using the fact $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ , completes the proof. ∎

We are now ready to prove the main theorem.

E.1 Proof of Theorem 2 ☕☕☕☕

Theorem 5 (Restatement of Theorem 2).

Let Assumptions 4, 5, 6, 7 and 8 hold and fix any $\delta\in(0,1)$ . Then, HyTQ algorithm policies $\{\pi_{k}\}_{k\in[K]}$ satisfy

	$\displaystyle\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq$	$\displaystyle\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)$
		$\displaystyle+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dKH^{2}}(% \lambda+H)\log(HK\|\mathcal{F}\|\|\mathcal{G}\|/\delta)\sqrt{\log(1+(K/d))})$

with probability at least $1-\delta$ .

Proof.

We let $V^{k}_{h}(s)=Q^{k}_{h}(s,\pi_{k}(s))$ for every $s,h$ . Since $\pi_{k}$ is the greedy policy w.r.t $Q^{k}$ , we also have $V^{k}_{h}(s)=Q^{k}_{h}(s,\pi_{k}(s))=\max_{a}Q^{k}_{h}(s,a)$ . We recall that $V^{*}=V^{\pi^{*}}$ and $Q^{*}=Q^{\pi^{*}}$ . We also note that the same holds true for any stationary Markov policy $\pi$ from (Zhang et al.,, 2023) that $Q^{\pi}$ satisfies $Q^{\pi}_{h}(s,a)=r_{h}(s,a)+\gamma\min_{P_{h,s,a}\ll P^{o}_{h,s,a}}(\mathbb{E}% _{s^{\prime}\sim P_{h,s,a}}[V^{\pi}_{h}(s^{\prime})]+\lambda D_{\varphi}(P_{h,% s,a},P^{o}_{h,s,a})).$ We can now further use the dual form (4) under Assumption 8, that is, for all $\pi$ and $f_{h+1}\in\mathcal{F}_{h+1}$ ,

$\displaystyle Q^{\pi}_{h}(s,a)$	$\displaystyle=r_{h}(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}% \sim P^{o}_{h,s,a}}[(\eta-V^{\pi}_{h+1}(s^{\prime}))_{+}]-\eta),\text{ and}$	(33)
$\displaystyle(\mathcal{T}f_{h+1})(s,a)$	$\displaystyle=r_{h}(s,a)-\inf_{\eta\in[0,\lambda]}~{}(\mathbb{E}_{s^{\prime}% \sim P^{o}_{h,s,a}}[(\eta-\max_{a^{\prime}}f_{h+1}(s^{\prime},a^{\prime}))_{+}% ]-\eta)$
$\displaystyle(\mathcal{T}_{g_{h}}f_{h+1})(s,a)$	$\displaystyle=r_{h}(s,a)-\mathbb{E}_{s^{\prime}\sim P^{o}_{h,s,a}}[(g_{h}(s,a)% -\max_{a^{\prime}}f_{h+1}(s^{\prime},a^{\prime}))_{+}]+g_{h}(s,a).$

We first characterize the performance decomposition between $V^{\pi^{*}}_{0}$ and ${V}^{\pi_{k}}_{0}$ . We recall the initial state distribution $d_{0}$ . Since $V^{\pi^{*}}(s)\geq V^{\pi_{k}}(s)$ for any $s\in\mathcal{S}$ , we observe that

$\displaystyle 0\leq$	$\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}_{0}(s_{0% })-V^{\pi_{k}}_{0}(s_{0})]=\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(V^{% \pi^{}}_{0}(s_{0})-V^{k}_{0}(s_{0}))-(V^{\pi_{k}}_{0}(s_{0})-V^{k}_{0}(s_{0}))]$
	$\displaystyle=\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{}}_{0}(s_% {0},\pi^{}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))-(Q^{\pi_{k}}_{0}(s_{0},% \pi_{k}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))]$
	$\displaystyle\leq\underbrace{\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{% \pi^{}}_{0}(s_{0},\pi^{}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]}_{(I)% }+\underbrace{\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}_{0}(s_{0},% \pi_{k}(s_{0}))-Q^{\pi_{k}}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]}_{(II)}.$	(34)

We rewrite the state-action distribution $d^{h,\pi}_{P^{o}}$ , dropping $P^{o}$ , as $d^{\pi}_{h}$ for simplicity. Letting $d^{\pi}_{h}$ also denote a state distribution ( $\Delta(\mathcal{S})$ ), we can write it as, for all $h$ ,

d^{\pi}_{h}=\begin{cases}d_{0}&\text{if $h=0$},\\ P^{o}_{h,s^{\prime},a^{\prime}}&\text{otherwise, with }s^{\prime}\sim d^{\pi}_% {h-1},a^{\prime}\sim\pi_{h}(s^{\prime}).\end{cases}

(35)

Analyzing one term in $(I)$ of (34) starting with the facts that $\pi_{k}$ is the greedy policy with respect to $Q^{k}$ and function $(x)_{+}$ is non-decreasing in $x\in\mathbb{R}$ :

	$\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{}}_{0}(s_{0},\pi^{}(s_{0}% ))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]\leq\mathbb{E}_{s_{0},a_{0}\sim d^{\pi% ^{}}_{0}}[(Q^{\pi^{}}_{0}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{}}_{0}}[(Q^{\pi^{}}_{0}(s_{0},a_{0})-\mathcal{T}Q^{k}_{1}(s_{0},a_{0}% ))_{+}]+\mathbb{E}_{s_{0},a_{0}\sim d^{\pi^{*}}_{0}}[(\mathcal{T}Q^{k}_{1}(s_{% 0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{}}_{0}}(\sup_{\eta}(\mathbb{E}_{s_{1}\sim P^{o}_{0,s_{0},a_{0}}}[(\eta% -\max_{a^{\prime}}Q^{k}_{1}(s_{1},a^{\prime}))_{+}-(\eta-\max_{a^{\prime}}Q^{% \pi^{}}_{1}(s_{1},a^{\prime}))_{+}]))_{+}$
	$\displaystyle\hskip 85.35826pt+\mathbb{E}_{s_{0},a_{0}\sim d^{\pi^{*}}_{0}}[(% \mathcal{T}Q^{k}_{1}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{}}_{0}}(\mathbb{E}_{s_{1}\sim P^{o}_{0,s_{0},a_{0}}}(\max_{a^{\prime}}% Q^{\pi^{}}_{1}(s_{1},a^{\prime})-\max_{a^{\prime}}Q^{k}_{1}(s_{1},a^{\prime})% )_{+})_{+}+\mathbb{E}_{s_{0},a_{0}\sim d^{\pi^{*}}_{0}}[(\mathcal{T}Q^{k}_{1}(% s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}))_{+}]$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\mathbb{E}_{s_{0},a_{0}\sim d% ^{\pi^{}}_{0}}\mathbb{E}_{s_{1}\sim P^{o}_{0,s_{0},a_{0}}}(Q^{\pi^{}}_{1}(s_% {1},\pi^{}(s_{1}))-Q^{k}_{1}(s_{1},\pi_{k}(s_{1})))_{+}+\mathbb{E}_{s_{0},a_{% 0}\sim d^{\pi^{}}_{0}}[(\mathcal{T}Q^{k}_{1}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{% 0}))_{+}]$
	$\displaystyle=\mathbb{E}_{s_{0}\sim d^{\pi^{}}_{1}}[(Q^{\pi^{}}_{1}(s_{1},% \pi^{}(s_{1}))-Q^{k}_{1}(s_{1},\pi_{k}(s_{1})))_{+}]+\mathbb{E}_{s_{0},a_{0}% \sim d^{\pi^{}}_{0}}[(\mathcal{T}Q^{k}_{1}(s_{0},a_{0})-Q^{k}_{0}(s_{0},a_{0}% ))_{+}],$		(36)

where $(a)$ follows by triangle inequality for $(\cdot)_{+}$ operation, $(b)$ from Bellman equation, operator $\mathcal{T}$ , and the fact $\inf_{x}p(x)-\inf_{x}q(x)\leq\sup_{x}(p(x)-q(x))$ , $(c)$ from the fact $(x)_{+}-(y)_{+}\leq(x-y)_{+}$ for any $x,y\in\mathbb{R}$ , $(d)$ follows by Jensen’s inequality and by definitions of policies $\pi^{*}$ and $\pi_{k}$ . Now, recursively applying this method for first term over horizon in (36) we get

	$\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{}}_{0}(s_{0},\pi^{}(s_{0}% ))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]$
	$\displaystyle\leq\mathbb{E}_{s_{H}\sim d_{H}}[(Q^{\pi^{}}_{H}(s_{H},\pi^{}(s% _{H}))-Q^{k}_{H}(s_{H},\pi_{k}(s_{H})))_{+}]+\sum_{h=0}^{H-1}\mathbb{E}_{s,a% \sim d^{\pi^{*}}_{h}}[(\mathcal{T}Q^{k}_{h+1}(s,a)-Q^{k}_{h}(s,a))_{+}]$
	$\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi^{*}}_{h}}[(% \mathcal{T}Q^{k}_{h+1}(s,a)-Q^{k}_{h}(s,a))_{+}],$		(37)

where the last inequality holds since $V^{\pi}_{H}(s_{H})=0$ for all $\pi$ and $Q^{k}_{H}(s_{H},\pi_{k}(s_{H}))=0$ .

Recall

C(\pi^{*})=\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{% \pi^{*}}_{h}}[(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a))_{+}]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]}.

Now, using (37) in $(I)$ of (34), the following holds with probability at least $1-\delta/2$ :

$\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{}}(s_{0},% \pi^{}(s_{0}))$	$\displaystyle-Q^{k}(s_{0},\pi_{k}(s_{0})))_{+}]\leq\sum_{k=0}^{K-1}\sum_{h=0}^% {H-1}\mathbb{E}_{s,a\sim d^{\pi^{*}}_{h}}[(\mathcal{T}Q^{k}_{h+1}(s,a)-Q^{k}_{% h}(s,a))_{+}]$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}\sum_{k=0}^{K-1}C(\pi^{*})% \sum_{h=0}^{H-1}\\|\mathcal{T}Q^{k}_{h+1}-Q^{k}_{h}\\|_{1,\mu_{h}}$
	$\displaystyle\stackrel{{\scriptstyle(f)}}{{\leq}}\sum_{k=0}^{K-1}C(\pi^{*})% \sum_{h=0}^{H-1}(\\|\mathcal{T}Q^{k}_{h+1}-\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}\\|% _{1,\mu_{h}}+\\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\\|_{2,\mu_{h}})$
	$\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}KHC(\pi^{*})(\Delta_{\mathrm{% dual,off}}+\Delta_{\mathrm{rQ,off}}),$	(38)

where $(e)$ follows from definition of $C(\pi^{*})$ in Assumption 4, $(f)$ from triangle inequality and the fact $\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}$ , and $(g)$ follows from Propositions 9 and 10.

For $(II)$ , firstly we note $\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}(s_{0},\pi_{k}(s_{0}))-Q^{\pi_{k}}(s_{0},% \pi_{k}(s_{0})))_{+}]=\mathbb{E}_{s_{0},a_{0}\sim d^{\pi_{k}}_{0}}[(Q^{k}(s_{0% },a_{0})-Q^{\pi_{k}}(s_{0},a_{0}))_{+}]$ . So, following the same analysis as in $(I)$ , we get

	$\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}(s_{0},\pi_{k}(s_{0}))-Q^{\pi_% {k}}(s_{0},\pi_{k}(s_{0})))_{+}]\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi% _{k}}_{h}}[(Q^{k}_{h}(s,a)-\mathcal{T}Q^{k}_{h+1}(s,a))_{+}]$
	$\displaystyle\leq\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[(Q^{k}_{% h}(s,a)-(\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1})(s,a))_{+}+((\mathcal{T}_{g^{k}_{h% }}Q^{k}_{h+1})(s,a)-(\mathcal{T}Q^{k}_{h+1})(s,a))_{+}],$		(39)

where the last inequality follows by triangle inequality for $(\cdot)_{+}$ operation.

Now, using (39) in $(II)$ of (34), we have

	$\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}(s_{0},\pi_{k}% (s_{0}))-Q^{\pi_{k}}(s_{0},\pi_{k}(s_{0})))_{+}]\leq$
	$\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{k}}_{% h}}[(Q^{k}_{h}(s,a)-(\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1})(s,a))_{+}]+\sum_{k=0}% ^{K-1}\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[((\mathcal{T}_{g^{k% }_{h}}Q^{k}_{h+1})(s,a)-\mathcal{T}Q^{k}_{h+1}(s,a))_{+}].$		(40)

Recall bilinear model from Assumption 7: $\mathbb{E}_{d^{\pi^{f}}_{h}}[(f_{h}-\mathcal{T}_{g_{h}}f_{h+1})_{+}]=\left% \lvert\left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle\right\rvert$ .

Analyzing the first part of (40), the following holds with probability at least $1-\delta/2$ :

$\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}$	$\displaystyle\mathbb{E}_{d^{\pi_{k}}_{h}}[(Q^{k}_{h}-\mathcal{T}_{g^{k}_{h}}Q^% {k}_{h+1})_{+}]\stackrel{{\scriptstyle(h)}}{{=}}\sum_{k=0}^{K-1}\sum_{h=0}^{H-% 1}\left\lvert\left\langle X_{h}(Q^{k}),W^{\mathrm{q}}_{h}(Q^{k},g^{k})\right% \rangle\right\rvert$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\\|X_{h}(Q^{k})\\|_{\Sigma_{k-1;h}^{-1}}\\|W^{\mathrm{q}}_{h}(Q^{k},g^{k})\\|_{% \Sigma_{k-1;h}}$
	$\displaystyle=\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\\|X_{h}(Q^{k})\\|_{\Sigma_{k-1;h}% ^{-1}}\sqrt{(W^{\mathrm{q}}_{h}(Q^{k},g^{k}))^{\top}\Sigma_{k-1;h}W^{\mathrm{q% }}_{h}(Q^{k},g^{k})}$
	$\displaystyle=\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\\|X_{h}(Q^{k})\\|_{\Sigma_{k-1;h}% ^{-1}}\sqrt{(W^{\mathrm{q}}_{h}(Q^{k},g^{k}))^{\top}(\sum_{i=0}^{k-1}X_{h}(Q^{% i})X_{h}(Q^{i})^{\top}+\sigma\mathds{1})W^{\mathrm{q}}_{h}(Q^{k},g^{k})}$
	$\displaystyle=\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\\|X_{h}(Q^{k})\\|_{\Sigma_{k-1;h}% ^{-1}}\sqrt{\sum_{i=0}^{k-1}\left\lvert\left\langle W^{\mathrm{q}}_{h}(Q^{k},g% ^{k}),X_{h}(Q^{i})\right\rangle\right\rvert^{2}+\sigma\left\\|W^{\mathrm{q}}_{h% }(Q^{k},g^{k})\right\\|^{2}}$
	$\displaystyle\stackrel{{\scriptstyle(j)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\\|X_{h}(Q^{k})\\|_{\Sigma_{k-1;h}^{-1}}\sqrt{\sum_{i=0}^{k-1}\left\lvert% \left\langle W^{\mathrm{q}}_{h}(Q^{k},g^{k}),X_{h}(Q^{i})\right\rangle\right% \rvert^{2}+\sigma B_{W}^{2}}$
	$\displaystyle\stackrel{{\scriptstyle(k)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\\|X_{h}(Q^{k})\\|_{\Sigma_{k-1;h}^{-1}}\sqrt{\sum_{i=0}^{k-1}\\|\mathcal{T}_{% g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\\|_{2,d^{\pi_{i}}_{h}}^{2}+\sigma B_{W}^{2}}$
	$\displaystyle\stackrel{{\scriptstyle(l)}}{{\leq}}\sum_{k=0}^{K-1}\sum_{h=0}^{H% -1}\\|X_{h}(Q^{k})\\|_{\Sigma_{k-1;h}^{-1}}(\sqrt{\sum_{i=0}^{k-1}\\|\mathcal{T}_% {g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\\|_{2,d^{\pi_{i}}_{h}}^{2}}+\sqrt{\sigma B_{W}% ^{2}})$
	$\displaystyle\stackrel{{\scriptstyle(m)}}{{\leq}}(\Delta_{\mathrm{rQ,on}}+% \sqrt{\sigma B_{W}^{2}})\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}\\|X_{h}(Q^{k})\\|_{% \Sigma_{k-1;h}^{-1}}$
	$\displaystyle\stackrel{{\scriptstyle(n)}}{{\leq}}(\Delta_{\mathrm{rQ,on}}+B_{X% }B_{W})\sqrt{2dH^{2}\log(1+\frac{K}{d})K},$	(41)

where $(h)$ follows from Assumption 7, $(i)$ from matrix Cauchy-Schwarz inequality, $(j)$ from Assumption 7, and $(k)$ by Assumption 7 with $\|\cdot\|_{1,d^{\pi_{i}}_{h}}\leq\|\cdot\|_{2,d^{\pi_{i}}_{h}}$ :

\displaystyle|\left\langle W^{\mathrm{q}}_{h}(Q^{k},g^{k}),X_{h}(Q^{i})\right\rangle|

\displaystyle=\mathbb{E}_{s,a\sim d^{\pi_{i}}_{h}}[(Q^{k}_{h}(s,a)-(\mathcal{T% }Q^{k}_{h+1})(s,a))_{+}]\leq\|\mathcal{T}_{g^{k}_{h}}Q^{k}_{h+1}-Q^{k}_{h}\|_{% 2,d^{\pi_{i}}_{h}}.

Finally, $(l)$ follows by the fact $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ , $(m)$ follows from Proposition 10, and $(n)$ follows from Lemma 6.

Now recall bilinear model from Assumption 7: $\mathbb{E}_{d^{\pi^{f}}_{h}}[(\mathcal{T}_{g_{h}}f_{h+1}-\mathcal{T}f_{h+1})_{% +}]=\left\lvert\left\langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangle\right\rvert$ . Following analysis above in (41) for the second part of (40) using Assumption 7 and Proposition 9, the following holds with probability at least $1-\delta/2$ :

\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}

\displaystyle\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[(\mathcal{T}_{g^{k}_{h}}Q^{k% }_{h+1}-\mathcal{T}Q^{k}_{h+1})_{+}]\leq(\Delta_{\mathrm{dual,on}}+B_{X}B_{W})% \sqrt{2dH^{2}\log(1+\frac{K}{d})K}.

(42)

Now combining Eqs. 41 and 42 with (40) we have

	$\displaystyle\sum_{k=0}^{K-1}\sum_{h=0}^{H-1}$	$\displaystyle\mathbb{E}_{s,a\sim d^{\pi_{k}}_{h}}[(Q^{k}(s_{0},\pi_{k}(s_{0}))% -Q^{\pi_{k}}(s_{0},\pi_{k}(s_{0})))_{+}]$
		$\displaystyle\leq(\Delta_{\mathrm{dual,on}}+\Delta_{\mathrm{rQ,on}}+2B_{X}B_{W% })\sqrt{2dH^{2}\log(1+\frac{K}{d})K},$

with probability at least $1-\delta$ . Finally, we combine this and (38) with (34):

	$\displaystyle 0$	$\displaystyle\leq\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}_{0}(% s_{0})-V^{\pi_{k}}_{0}(s_{0})]\leq KHC(\pi^{})(\Delta_{\mathrm{dual,off}}+% \Delta_{\mathrm{rQ,off}})+$
		$\displaystyle\hskip 85.35826pt(\Delta_{\mathrm{dual,on}}+\Delta_{\mathrm{rQ,on% }}+2B_{X}B_{W})\sqrt{2dH^{2}\log(1+\frac{K}{d})K}.$

Let $N=m_{\mathrm{off}}+K\cdot m_{\mathrm{on}}$ . Using offline bounds from Propositions 9 and 10 with $c_{1}=2\lambda+H$ from Proposition 3, we have:

	$\displaystyle 0$	$\displaystyle\leq\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}_{0}(% s_{0})-V^{\pi_{k}}_{0}(s_{0})]\leq KHC(\pi^{})\cdot$
		$\displaystyle(\frac{1}{m_{\mathrm{off}}}\left(3\varepsilon_{\mathcal{G}}N+48(2% \lambda+H)\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)\right)+\frac{1}{\sqrt{m_{% \mathrm{off}}}}\left(\sqrt{3\varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+2% \lambda+2H)\sqrt{\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)}\right))$
		$\displaystyle+(\Delta_{\mathrm{dual,on}}+\Delta_{\mathrm{rQ,on}}+2B_{X}B_{W})% \sqrt{2dH^{2}\log(1+\frac{K}{d})K}.$

Now using on-policy bounds from Propositions 9 and 10 with $c_{1}=2\lambda+H$ from Proposition 3, we have:

	$\displaystyle 0$	$\displaystyle\leq\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}_{0}(% s_{0})-V^{\pi_{k}}_{0}(s_{0})]\leq KHC(\pi^{})\cdot$
		$\displaystyle(\frac{1}{m_{\mathrm{off}}}\left(3\varepsilon_{\mathcal{G}}N+48(2% \lambda+H)\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)\right)+\frac{1}{\sqrt{m_{% \mathrm{off}}}}\left(\sqrt{3\varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+2% \lambda+2H)\sqrt{\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)}\right))$
		$\displaystyle+(\frac{1}{m_{\mathrm{on}}}\left(3\varepsilon_{\mathcal{G}}N+48(2% \lambda+H)\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)\right)$
		$\displaystyle\hskip 42.67912pt+\frac{1}{\sqrt{m_{\mathrm{on}}}}\left(\sqrt{3% \varepsilon_{\mathcal{F},\mathrm{r}}N}+8(1+2\lambda+2H)\sqrt{\log(2HK\|\mathcal% {G}\|\|\mathcal{F}\|/\delta)}\right)+2B_{X}B_{W})\cdot\sqrt{2dH^{2}\log(1+\frac{K% }{d})K}$

Finally, choosing higher order terms by setting $m_{\mathrm{on}}=1$ and $m_{\mathrm{off}}=K$ , we get

	$\displaystyle 0\leq$	$\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0% })-V^{\pi_{k}}_{0}(s_{0})]$
		$\displaystyle\leq\sqrt{K}HC(\pi^{*})(6(\varepsilon_{\mathcal{G}}+\sqrt{% \varepsilon_{\mathcal{F},\mathrm{r}}})K^{2}+(8+112\lambda+64H)\log(2HK\|% \mathcal{G}\|\|\mathcal{F}\|/\delta))$
		$\displaystyle+(6(\varepsilon_{\mathcal{G}}+\sqrt{\varepsilon_{\mathcal{F},% \mathrm{r}}})K^{2}+8+112\lambda+64H\log(2HK\|\mathcal{G}\|\|\mathcal{F}\|/\delta)+% 2B_{X}B_{W})\cdot\sqrt{2dH^{2}\log(1+\frac{K}{d})K}$
		$\displaystyle\leq\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1% \}\sqrt{dKH^{2}}(\lambda+H)\log(HK\|\mathcal{F}\|\|\mathcal{G}\|/\delta)\sqrt{\log% (1+(K/d))}).$

The proof is now complete. ∎

E.2 HyTQ Algorithm Specialized Results ☕☕☕

In this section we specialize our main result Theorem 2 for different bilinear model classes and also provide an equivalent sample complexity guarantee in the offline robust RL setting.

Before we move ahead, we showcase an important property of our robust transfer coefficient $C(\pi)$ for any fixed policy. Fixing a nominal model $P^{o}$ , the transfer coefficient considers the distribution shift w.r.t the data-generating distribution along the general function class which the algorithm uses. It is in fact smaller than the existing density ratio based concentrability assumption (Assumption 9). We state this result in the following lemma.

Lemma 8.

For any policy $\pi$ and offline distribution $\mu$ , we have $C(\pi)\leq\sup_{h,s,a}{d^{\pi}_{h}(s,a)}/{\mu_{h}(s,a)}.$

Proof.

By definition in Assumption 4, we get that

	$\displaystyle C(\pi)$	$\displaystyle=\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d% ^{\pi}_{h}}[(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a))_{+}]}{\sum_{h=0}^{H-1}\mathbb% {E}_{s,a\sim\mu_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}$
		$\displaystyle\leq\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a% \sim d^{\pi}_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\max_{f\in\mathcal{F},h\in[H]% }\frac{\mathbb{E}_{s,a\sim d^{\pi}_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}% {\mathbb{E}_{s,a\sim\mu_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}\leq\sup_{h% ,s,a}\frac{d^{\pi}_{h}(s,a)}{\mu_{h}(s,a)},$

where $(a)$ follows from the Mediant inequality. ∎

Remark 5.

The concentrability assumption (Assumption 9) is in fact the same non-robust RL concentrability assumption (Munos and Szepesvári,, 2008; Chen and Jiang,, 2019). We make two important points here. Firstly, our transfer coefficient is larger than the transfer coefficient (Song et al.,, 2023, Definition 1) using the fact $\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}$ . Secondly, our transfer coefficient is not directly comparable with the l2-norm version transfer coefficient (Xie et al.,, 2021, Definition 1). It is an interesting open question for future research to investigate about minimax lower bound guarantees w.r.t different transfer coefficients for both non-robust and robust RL problems.

We now define a bilinear model called Low Occupancy Complexity (Du et al.,, 2021, Definition 4.7). The nominal model $P^{o}$ and realizable function class $\mathcal{F}$ has low occupancy complexity w.r.t., for each $h\in[H]$ , a (possibly unknown to the learner) feature map $\psi=(\psi_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{Y})$ , where $\mathcal{Y}$ is a Hilbert space, and w.r.t. to a (possibly unknown to the learner) map $\nu_{h}:\mathcal{F}\mapsto\mathcal{Y}$ such that for all $f\in\mathcal{F}$ , with greedy policy $\pi^{f}$ w.r.t. $f$ , and $(s,a)$ we have

\displaystyle d^{h,\pi^{f}}_{P^{o}}(s,a)=\langle\psi_{h}(s,a),\nu_{h}(f)\rangle.

(43)

We make the following assumption on the offline data-generating distribution (or policy by slight notational override for convenience).

Assumption 11.

Consider the Low Occupancy Complexity model (bilinear model) on $\mathcal{Y}=\mathbb{R}^{d}$ . Let the offline data distribution $\mu=\{\mu_{h}\}_{h\in[H]}$ satisfy a low rank structure, i.e. $\mu_{h}(s,a)=\langle\psi_{h}(s,a),\nu_{h}(f^{\mathrm{off}})\rangle=\sum_{i\in[% d]}\psi_{h,i}(s,a)\nu_{h,i}(f^{\mathrm{off}})$ , for some $f^{\mathrm{off}}\in\mathcal{F}$ .

Now we extend our main result Theorem 2 in this next result specializing to the Low Occupancy Complexity (43) bilinear model.

Corollary 3 (Cumulative Suboptimality of Theorem 2 in Low Occupancy Complexity (43) bilinear model).

Consider the Low Occupancy Complexity (43) bilinear model. Let Assumptions 4, 5, 6 and 8 hold and fix any $\delta\in(0,1)$ . Then, HyTQ algorithm policies $\{\pi_{k}\}_{k\in[K]}$ satisfy

	$\displaystyle\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq$	$\displaystyle\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)$
		$\displaystyle+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dKH^{2}}(% \lambda+H)\log(HK\|\mathcal{F}\|\|\mathcal{G}\|/\delta)\sqrt{\log(1+(K/d))})$
		$\displaystyle+\widetilde{\mathcal{O}}(\sqrt{dKH^{4}}\max_{f\in\mathcal{F}}\\|% \nu_{h}(f)\\|_{2}\\|\sum_{s,a}\psi_{h}(s,a)\\|_{2}\sqrt{\log(1+(K/d))})$

with probability at least $1-\delta$ . Now, consider the offline data distribution as in Assumption 11 with perfect robust Bellman completeness, i.e. $\varepsilon_{\mathcal{F},\mathrm{r}}=0=\varepsilon_{\mathcal{G}}$ . We have $C(\pi^{*})\leq\sup_{h,i\in[d]}({\nu_{h,i}^{*}}/{\nu_{h,i}(f^{\mathrm{off}})}).$

Proof.

Using the Low Occupancy Complexity (43) bilinear model, we have $\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(\mathcal{T}_{g_{h}}f_{h+1}-\mathcal{T}f_{h% +1})_{+}]=\left\langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangle$ , where

X_{h}(f)=\nu_{h}(f),\qquad W^{\mathrm{d}}_{h}(f,g)=\sum_{(s,a)\in\mathcal{S}% \times\mathcal{A}}\psi_{h}(s,a)((\mathcal{T}_{g_{h}}f_{h+1})(s,a)-(\mathcal{T}% f_{h+1})(s,a))_{+}.

We also have $\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(f_{h}-\mathcal{T}_{g_{h}}f_{h+1})_{+}]={% \left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle}$ , where

\qquad W^{\mathrm{q}}_{h}(f,g)=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\psi% _{h}(s,a)(f_{h}(s,a)-(\mathcal{T}_{g_{h}}f_{h+1})(s,a))_{+}.

Furthermore, we set $B_{X}=\max_{f\in\mathcal{F}}\|\nu_{h}(f)\|_{2}$ . Since $\mathcal{F}$ is realizable and $\mathcal{T}_{g}$ is complete, we set $B_{W}=H\|\sum_{s,a}\psi_{h}(s,a)\|_{2}$ . Then the result directly follows by Theorem 2.

For the second statement, first note that the occupancy $d^{\pi^{*}}_{h}$ is low-rank as well since we assume perfect Bellman completeness. Following the proof of Lemma 8 we get

	$\displaystyle C(\pi^{*})$	$\displaystyle=\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a\sim d% ^{\pi^{*}}_{h}}[(\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a))_{+}]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}$
		$\displaystyle\leq\max_{f\in\mathcal{F}}\frac{\sum_{h=0}^{H-1}\mathbb{E}_{s,a% \sim d^{\pi^{*}}_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}{\sum_{h=0}^{H-1}% \mathbb{E}_{s,a\sim\mu_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\max_{f\in\mathcal{F},h\in[H]% }\frac{\mathbb{E}_{s,a\sim d^{\pi}_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}% {\mathbb{E}_{s,a\sim\mu_{h}}[\|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)\|]}$
		$\displaystyle\leq\sup_{h,s,a}\frac{d^{\pi}_{h}(s,a)}{\mu_{h}(s,a)}\stackrel{{% \scriptstyle(b)}}{{\leq}}\sup_{h,i\in[d]}\frac{\nu_{h,i}^{*}}{\nu_{h,i}(f^{% \mathrm{off}})},$

where $(a),(b)$ follows from the Mediant inequality. This completes the proof. ∎

We now define a bilinear model called Low-rank Feature Selection Model (Du et al.,, 2021, Definition A.1). The nominal model $P^{o}$ is a low-rank feature selection model if it satisfies $P^{o}_{h,s,a}(s^{\prime})=\langle\theta_{h}(s,a),\psi_{h}(s^{\prime})\rangle$ , for each $h\in[H]$ and all $(s,a,s^{\prime})$ , with a (possibly unknown to the learner) map $\theta=(\theta_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{Y})$ and a (possibly unknown to the learner) map $\psi_{h}:\mathcal{S}\mapsto\mathcal{Y}$ , where $\mathcal{Y}$ is a Hilbert space.

This model specializes to the kernel MDP model when the map $\theta$ is known to the learner (Jin et al., 2021a, , Definition 30). This model also specializes to the low-rank MDP model when $\mathcal{Y}=\mathbb{R}^{d}$ (Huang et al.,, 2023, Assumption 1) and furthermore to linear MDP model when the map $\theta$ is also known to the learner (Du et al.,, 2021, Definition A.4).

We make the following assumption on the offline data-generating distribution (or policy by slight notational override for convenience).

Assumption 12.

Consider the Low-rank MDP Model (bilinear model). Let the offline data distribution $\mu=\{\mu_{h}\}_{h\in[H]}$ satisfy $\max_{h,s,a}{\pi^{*}_{h}(a|s)}/{\mu_{h}(a|s)}\leq\alpha$ and suppose that $\mu$ is induced by the nominal model, i.e. $\mu_{0}(s)=d_{0}(s)$ (starting state distribution) and $\mu_{h}(s)=\mathbb{E}_{s^{\prime},a^{\prime}\sim\mu_{h-1}}P^{o}_{h-1,s^{\prime% },a^{\prime}}(s)$ for any $h\geq 1$ . Furthermore, suppose that $\mu$ satisfies that the feature covariance matrix $\Sigma_{\mu_{h-1},\theta}=\mathbb{E}_{s,a\sim\mu_{h-1}}[\theta_{h}(s,a)\theta_% {h}(s,a)^{\top}]$ is invertible for all $h\in[H]$ and $\mathbb{E}_{s,a\sim\mu_{h}}[|\mathcal{T}f_{h+1}(s,a)-f_{h}(s,a)|]\geq 1$ for at least one $h\in[H]$ and all $f\in\mathcal{F}$ .

Now we extend our main result Theorem 2 in this next result specializing to the Low-rank Feature Selection Model bilinear model.

Corollary 4 (Cumulative Suboptimality of Theorem 2 in Low-rank Feature Selection Model (bilinear model)).

Consider the Low-rank Feature Selection Model (bilinear model). Let Assumptions 4, 5, 6 and 8 hold and fix any $\delta\in(0,1)$ . Then, HyTQ algorithm policies $\{\pi_{k}\}_{k\in[K]}$ satisfy

	$\displaystyle\sum_{k=0}^{K-1}(V^{\pi^{*}}-V^{\pi_{k}})\leq$	$\displaystyle\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{5/2}H)$
		$\displaystyle+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1\}\sqrt{dKH^{2}}(% \lambda+H)\log(HK\|\mathcal{F}\|\|\mathcal{G}\|/\delta)\sqrt{\log(1+(K/d))})$
		$\displaystyle+\widetilde{\mathcal{O}}(\sqrt{dKH^{4}}\\|\sum_{s,a}\theta_{h}(s,a% )\\|_{2}\\|\sum_{s}\psi_{h}(s)\\|_{2}\sqrt{\log(1+(K/d))})$

with probability at least $1-\delta$ . Now, consider the offline data distribution as in Assumption 12 with a low-rank MDP model. We have

\displaystyle C(\pi^{*})

\displaystyle\leq\sqrt{2\alpha H}\sum_{h=1}^{H}\mathbb{E}_{s,a\sim d^{h-1,\pi^% {*}}_{P^{o}}}\left\|\theta_{h}(s,a)\right\|_{\Sigma_{\mu_{h-1},\theta}^{-1}}+% \sqrt{\alpha}.

Proof.

We first begin with establishing a Q-value-dependent linearity property for the state-action-visitation measure $d^{h,\pi^{f}}_{P^{o}}(s,a)$ . To do this, we adapt the proof of Huang et al., (2023, Lemma 17) here. We start by writing the state-visitation measure by recalling Eq. 35 here:

	$\displaystyle d^{h,\pi^{f}}_{P^{o}}(s_{h})$	$\displaystyle=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}P^{o}_{h,s,a}(s_{h})% \pi^{f}_{h-1}(a\|s)d^{h-1,\pi^{f}}_{P^{o}}(s)$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{(s,a)\in\mathcal{S}\times% \mathcal{A}}\langle\theta_{h}(s,a),\psi_{h}(s_{h})\rangle\pi^{f}_{h-1}(a\|s)d^{% h-1,\pi^{f}}_{P^{o}}(s)$
		$\displaystyle=\langle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\theta_{h}(s,a% )\pi^{f}_{h-1}(a\|s)d^{h-1,\pi^{f}}_{P^{o}}(s),\psi_{h}(s_{h})\rangle=\langle% \psi_{h}(s_{h}),\nu_{h,\pi^{f}}(f)\rangle,$

where $(a)$ follows by the low-rank feature selection model definition, and the last equality follows by taking a functional $\nu_{h,\pi^{f}}(f)=\sum_{s,a}\theta_{h}(s,a)\pi^{f}_{h-1}(a|s)d^{h-1,\pi^{f}}_% {P^{o}}(s)$ . Since we consider the finite action space with possibly large state space setting for our results, the state-action visitation measure for the deterministic non-stationary policy $\pi^{f}$ is now given by $d^{h,\pi^{f}}_{P^{o}}(s_{h},a_{h})=\langle\psi^{\prime}_{h,\pi^{f}}(s_{h},a_{h% }),\nu_{h,\pi^{f}}(f)\rangle$ with $\psi^{\prime}_{h,\pi^{f}}(s_{h},a_{h})=C\psi_{h}(s_{h})1\{a_{h}=\pi^{f}_{h}(s)\}$ for features $\psi^{\prime}_{h,\pi^{f}}:\mathcal{S}\times\mathcal{A}\to\mathcal{Y}$ . Here $C>0$ is a normalizing constant such that the state-action visitation measure is a probability measure.

We now have $\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(\mathcal{T}_{g_{h}}f_{h+1}-\mathcal{T}f_{h% +1})_{+}]=\left\langle X_{h}(f),W^{\mathrm{d}}_{h}(f,g)\right\rangle$ , where

X_{h}(f)=\nu_{h,\pi^{f}}(f),\qquad W^{\mathrm{d}}_{h}(f,g)=\sum_{(s,a)\in% \mathcal{S}\times\mathcal{A}}\psi^{\prime}_{h,\pi^{f}}(s,a)((\mathcal{T}_{g_{h% }}f_{h+1})(s,a)-(\mathcal{T}f_{h+1})(s,a))_{+}.

We also have $\mathbb{E}_{d^{h,\pi^{f}}_{P^{o}}}[(f_{h}-\mathcal{T}_{g_{h}}f_{h+1})_{+}]={% \left\langle X_{h}(f),W^{\mathrm{q}}_{h}(f,g)\right\rangle}$ , where

\qquad W^{\mathrm{q}}_{h}(f,g)=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\psi% ^{\prime}_{h,\pi^{f}}(s,a)(f_{h}(s,a)-(\mathcal{T}_{g_{h}}f_{h+1})(s,a))_{+}.

Furthermore, we set

\max_{f\in\mathcal{F}}\|\nu_{h}(f)\|_{2}=\max_{f\in\mathcal{F}}\|\sum_{s,a}% \theta_{h}(s,a)\pi^{f}(a|s)d^{h-1,\pi^{f}}_{P^{o}}(s)\|_{2}\leq\|\sum_{s,a}% \theta_{h}(s,a)\|_{2}=B_{X}.

Since $\mathcal{F}$ is realizable and $\mathcal{T}_{g}$ is complete for all $g\in\mathcal{G}$ , we set

H\|\sum_{s,a}\psi^{\prime}_{h,\pi^{f}}(s,a)\|_{2}=HC\|\sum_{s,a}\psi_{h}(s)1\{% a=\pi^{f}_{h}(s)\}\|_{2}\leq HC\|\sum_{s}\psi_{h}(s)\|_{2}=B_{W}.

Then the first result directly follows by Theorem 2. Following the proof of Song et al., (2023, Lemma 13) for our transfer coefficient $C(\pi^{*})$ , with the facts $(x-y)^{2}\leq|x-y||x+y|$ for $x,y\geq 0$ and $\|f_{h}\|_{\infty}\leq H$ for all $h\in[H]$ , the last statement for $C(\pi^{*})$ follows. This completes the proof. ∎

Now we extend our main result Theorem 2 in this next result to showcase sample complexity for comparisons with offline+online RL setting.

Corollary 5 (Offline+Online RL Sample Complexity of the HyTQ algorithm).

Let Assumptions 4, 5, 6, 7 and 8 hold. Fix any $\delta\in(0,1)$ and any $\varepsilon>0$ , and let $N_{\mathrm{tot}}$ be the total number of sample tuples used in HyTQ algorithm. Then, the uniform policy $\widehat{\pi}$ (uniform convex combination) of HyTQ algorithm policies $\{\pi_{k}\}_{k\in[K]}$ satisfy, with probability at least $1-\delta$ ,

\displaystyle V^{\pi^{*}}-V^{\widehat{\pi}}\leq\varepsilon,\quad\text{ if }N% \geq N_{\mathrm{tot}}=\widetilde{\mathcal{O}}(\frac{\max\{(C(\pi^{*}))^{2},1\}% dH^{3}(\lambda+H)^{2}}{\varepsilon^{2}}\log^{2}(H|\mathcal{F}||\mathcal{G}|/% \delta)).

Proof.

This proof is straightforward from the Theorem 2 using a standard online-to-batch conversion (Shalev-Shwartz and Ben-David,, 2014, Theorem 14.8 & Chapter 21). Define the policy $\widehat{\pi}=\text{Uniform}\{\pi_{0},\dots,\pi_{K-1}\}$ . From Theorem 2, we get

	$\displaystyle 0\leq\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}_{0}(s_{0})-V^{% \widehat{\pi}}_{0}(s_{0})]=\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_% {0}}[V^{\pi^{}}_{0}(s_{0})-V^{\pi_{k}}_{0}(s_{0})]$
	$\displaystyle\leq\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+% \varepsilon_{\mathcal{G}})K^{3/2}H)+\widetilde{\mathcal{O}}(\max\{C(\pi^{*}),1% \}\sqrt{dH^{2}/K}(\lambda+H)\log(HK\|\mathcal{F}\|\|\mathcal{G}\|/\delta)\sqrt{% \log(1+(K/d))}).$

We recall that our algorithm uses $m_{\mathrm{off}}H$ number of offline samples and $m_{\mathrm{on}}HK$ number of on-policy samples in the datasets $\{\mathcal{D}^{\mu}_{h},\mathcal{D}^{0}_{h},\cdots,\mathcal{D}^{K-1}_{h}\}$ for all $h\in[H]$ . Since we set $m_{\mathrm{on}}=1$ and $m_{\mathrm{off}}=K$ , the total number of offline and on-policy samples is $2HK$ .

Fix any $\varepsilon>0$ . For approximations $\varepsilon_{\mathcal{F},\mathrm{r}},\varepsilon_{\mathcal{G}}$ , we first assume there exists $K_{1}=\widetilde{\mathcal{O}}(H^{4})$ such that $\mathcal{O}((\sqrt{\varepsilon_{\mathcal{F},\mathrm{r}}}+\varepsilon_{\mathcal% {G}})K^{3/2}H)\leq\varepsilon/2$ for all $K\geq K_{1}$ . Let

K_{2}=\widetilde{\mathcal{O}}(\frac{\max\{(C(\pi^{*}))^{2},1\}dH^{2}(\lambda+H% )^{2}}{\varepsilon^{2}}\log^{2}(H|\mathcal{F}||\mathcal{G}|/\delta)).

Then, for $K\geq K_{1}+K_{2}$ , we have $\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{*}}_{0}(s_{0})-V^{\widehat{\pi}}_{0}(s_{0% })]\leq\varepsilon$ with probability at least $1-\delta$ . So, the total number of samples is at least $N_{\mathrm{tot}}$ :

N_{\mathrm{tot}}=2H(K_{1}+K_{2})=\widetilde{\mathcal{O}}(\frac{\max\{(C(\pi^{*% }))^{2},1\}dH^{3}(\lambda+H)^{2}}{\varepsilon^{2}}\log^{2}(H|\mathcal{F}||% \mathcal{G}|/\delta)).

This completes the proof. ∎

$\displaystyle 0\leq$	$\displaystyle\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}(s_{0})-{V}^{\pi_{K}}(s_{% 0})]=\mathbb{E}_{s_{0}\sim d_{0}}[(V^{\pi^{}}(s_{0})-V_{K}(s_{0}))-(V^{\pi_{K% }}(s_{0})-V_{K}(s_{0}))]$
	$\displaystyle=\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-% Q_{K}(s_{0},\pi_{K}(s_{0})))-(Q^{\pi_{K}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},% \pi_{K}(s_{0})))]$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{*}(s_{0}))+Q_{K}(s_{0},\pi_% {K}(s_{0}))-Q^{\pi_{K}}(s_{0},\pi_{K}(s_{0}))]$
	$\displaystyle=\mathbb{E}_{s_{0}\sim d_{0}}[Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q% _{K}(s_{0},\pi^{}(s_{0}))+Q_{K}(s_{0},\pi_{K}(s_{0}))-Q^{\pi^{}}(s_{0},\pi_{% K}(s_{0}))$
	$\displaystyle\hskip 142.26378pt+Q^{\pi^{*}}(s_{0},\pi_{K}(s_{0}))-Q^{\pi_{K}}(% s_{0},\pi_{K}(s_{0}))]$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{}(s_{0}))+Q_{K}(s_{0},\pi_% {K}(s_{0}))-Q^{\pi^{}}(s_{0},\pi_{K}(s_{0}))$
	$\displaystyle\hskip 56.9055pt+\gamma[\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{% s_{0},\pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi^% {*}}(s_{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}% (s_{0})}))$
	$\displaystyle\hskip 85.35826pt-\min_{P_{s_{0},\pi_{K}(s_{0})}\ll P^{o}_{s_{0},% \pi_{K}(s_{0})}}(\mathbb{E}_{s_{1}\sim P_{s_{0},\pi_{K}(s_{0})}}[V^{\pi_{K}}(s% _{1})]+\lambda D_{\varphi}(P_{s_{0},\pi_{K}(s_{0})},P^{o}_{s_{0},\pi_{K}(s_{0}% )}))]]$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\mathbb{E}_{s_{0}\sim d_{0}}[% \|Q^{\pi^{}}(s_{0},\pi^{}(s_{0}))-Q_{K}(s_{0},\pi^{}(s_{0}))\|]+\mathbb{E}_{s% _{0}\sim d_{0}}[\|Q^{\pi^{}}(s_{0},\pi_{K}(s_{0}))-Q_{K}(s_{0},\pi_{K}(s_{0}))\|]$
	$\displaystyle\hskip 113.81102pt+\gamma\mathbb{E}_{s_{0}\sim d_{0}}\mathbb{E}_{% s_{1}\sim P^{\pi_{K},\min}_{s_{0},\pi_{K}(s_{0})}}(\|V^{\pi^{*}}(s_{1})-V^{\pi_% {K}}(s_{1})\|)$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\sum_{h=0}^{\infty}\gamma^{h}% \cdot\bigg{(}\mathbb{E}_{s\sim d_{h,\pi_{K}}}[\|Q^{\pi^{}}(s,\pi^{}(s))-Q_{K}% (s,\pi^{}(s))\|+\|Q^{\pi^{}}(s,\pi_{K}(s))-Q_{K}(s,\pi_{K}(s))\|]\bigg{)},$	(25)

$\displaystyle 0\leq$	$\displaystyle Q^{\pi^{}}(s,a)-Q_{K}(s,a)\leq Q^{\pi^{}}(s,a)-\mathcal{T}Q_{K% -1}(s,a)+\|\mathcal{T}Q_{K-1}(s,a)-Q_{K}(s,a)\|$
	$\displaystyle\leq Q^{\pi^{*}}(s,a)-\mathcal{T}Q_{K-1}(s,a)+\\|\mathcal{T}Q_{K-1% }-Q_{K}\\|_{1,\nu}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}Q^{\pi^{*}}(s,a)-\mathcal{T}Q% _{K-1}(s,a)+\sqrt{C}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(f)}}{{=}}\gamma[\min_{P_{s,a}\ll P^{o}_{s% ,a}}(\mathbb{E}_{s^{\prime}\sim P_{s,a}}[\max_{a^{\prime}}Q^{\pi^{*}}(s^{% \prime},a^{\prime})]+\lambda D_{\varphi}(P_{s,a},P^{o}_{s,a}))$
	$\displaystyle\hskip 85.35826pt-\min_{P_{s,a}\ll P^{o}_{s,a}}(\mathbb{E}_{s^{% \prime}\sim P_{s,a}}[\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime})]+\lambda D% _{\varphi}(P_{s,a},P^{o}_{s,a}))]$
	$\displaystyle\hskip 170.71652pt+\sqrt{C}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}\gamma(\mathbb{E}_{s^{\prime}% \sim P^{Q_{K-1},\min}_{s,a}}(\max_{a^{\prime}}Q^{\pi^{*}}(s^{\prime},a^{\prime% })-\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime})))+\sqrt{C}\\|\mathcal{T}Q_{K% -1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(h)}}{{\leq}}\gamma(\mathbb{E}_{s^{\prime}% \sim P^{Q_{K-1},\min}_{s,a}}\max_{a^{\prime}}\|Q^{\pi^{*}}(s^{\prime},a^{\prime% })-Q_{K-1}(s^{\prime},a^{\prime})\|)+\sqrt{C}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,% \mu},$	(27)

$\displaystyle\\|$	$\displaystyle Q^{\pi^{}}-Q_{K}\\|_{1,\nu}\leq\gamma^{K}\sup_{\bar{\nu}}\\|Q^{% \pi^{}}-Q_{0}\\|_{1,\bar{\nu}}+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T}% Q_{K-1-t}-\mathcal{T}_{g_{K-1-t}}Q_{K-1-t}\\|_{1,\mu}$
	$\displaystyle\hskip 113.81102pt+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T% }_{g_{K-1-t}}Q_{K-1-t}-Q_{K-t}\\|_{2,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(j)}}{{\leq}}\frac{\gamma^{K}}{1-\gamma}+% \sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T}Q_{K-1-t}-\mathcal{T}_{g_{K-1-t% }}Q_{K-1-t}\\|_{1,\mu}$
	$\displaystyle\hskip 113.81102pt+\sqrt{C}\sum_{t=0}^{K-1}\gamma^{t}\\|\mathcal{T% }_{g_{K-1-t}}Q_{K-1-t}-Q_{K-t}\\|_{2,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(k)}}{{\leq}}\frac{\gamma^{K}}{1-\gamma}+% \frac{\sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\\|\mathcal{T}f-\mathcal{T}_{% \widehat{g}_{f}}f\\|_{1,\mu}+\frac{\sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\\|% \mathcal{T}_{\widehat{g}_{f}}f-\widehat{f}_{\widehat{g}_{f}}\\|_{2,\mu}$
	$\displaystyle\leq\frac{\gamma^{K}}{1-\gamma}+\frac{\sqrt{C}}{1-\gamma}\sup_{f% \in\mathcal{F}}\\|\mathcal{T}f-\mathcal{T}_{\widehat{g}_{f}}f\\|_{1,\mu}+\frac{% \sqrt{C}}{1-\gamma}\sup_{f\in\mathcal{F}}\sup_{g\in\mathcal{G}}\\|\mathcal{T}_{% g}f-\widehat{f}_{g}\\|_{2,\mu}.$	(29)

	$\displaystyle\\|Q^{\pi^{}}-Q_{K}\\|_{1,\nu}\leq\\|Q^{\pi^{}}-\mathcal{T}Q_{K-1}% \\|_{1,\nu}+\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\nu}$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\\|Q^{\pi^{*}}-\mathcal{T}Q_{K% -1}\\|_{1,\nu}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle=(\mathbb{E}_{s,a\sim\nu}\|Q^{\pi^{*}}(s,a)-\mathcal{T}Q_{K-1}(s,a% )\|)+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}(\mathbb{E}_{s,a\sim\nu}% \gamma\sup_{\eta}\|\mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}((\eta-\max_{a^{% \prime}}Q_{K-1}(s^{\prime},a^{\prime}))_{+}-(\eta-\max_{a^{\prime}}Q^{\pi^{*}}% (s^{\prime},a^{\prime}))_{+})\|)$
	$\displaystyle\hskip 170.71652pt+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{% K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(f)}}{{\leq}}(\mathbb{E}_{s,a\sim\nu}\|% \mathbb{E}_{s^{\prime}\sim P^{o}_{s,a}}(\max_{a^{\prime}}Q^{\pi^{*}}(s^{\prime% },a^{\prime})-\max_{a^{\prime}}Q_{K-1}(s^{\prime},a^{\prime}))_{+}\|)+\sqrt{C_{% \mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(g)}}{{\leq}}\gamma(\mathbb{E}_{s,a\sim\nu% }\mathbb{E}_{s^{\prime}\sim{P}^{o}_{s,a}}\max_{a^{\prime}}\|Q^{\pi^{*}}(s^{% \prime},a^{\prime})-Q_{K-1}(s^{\prime},a^{\prime})\|)+\sqrt{C_{\mathrm{tv}}}\\|% \mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(h)}}{{\leq}}\gamma\\|Q^{\pi^{*}}-Q_{K-1}\\|% _{1,\nu^{\prime}}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-Q_{K}\\|_{1,\mu}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\gamma\\|Q^{\pi^{*}}-Q_{K-1}\\|% _{1,\nu^{\prime}}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}_{g_{K-1}}Q_{K-1}-Q_{K}\\|% _{2,\mu}+\sqrt{C_{\mathrm{tv}}}\\|\mathcal{T}Q_{K-1}-\mathcal{T}_{g_{K-1}}Q_{K-% 1}\\|_{1,\mu},$

$\displaystyle 0\leq$	$\displaystyle\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[V^{\pi^{}}_{0}(s_{0% })-V^{\pi_{k}}_{0}(s_{0})]=\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(V^{% \pi^{}}_{0}(s_{0})-V^{k}_{0}(s_{0}))-(V^{\pi_{k}}_{0}(s_{0})-V^{k}_{0}(s_{0}))]$
	$\displaystyle=\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{\pi^{}}_{0}(s_% {0},\pi^{}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))-(Q^{\pi_{k}}_{0}(s_{0},% \pi_{k}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))]$
	$\displaystyle\leq\underbrace{\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{% \pi^{}}_{0}(s_{0},\pi^{}(s_{0}))-Q^{k}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]}_{(I)% }+\underbrace{\sum_{k=0}^{K-1}\mathbb{E}_{s_{0}\sim d_{0}}[(Q^{k}_{0}(s_{0},% \pi_{k}(s_{0}))-Q^{\pi_{k}}_{0}(s_{0},\pi_{k}(s_{0})))_{+}]}_{(II)}.$	(34)

Model-Free Robust φ𝜑\varphiitalic_φ-Divergence Reinforcement Learning Using Both Offline and Online Data

Abstract

1 Introduction

2 Offline Robust φ𝜑\varphiitalic_φ-Regularized Reinforcement Learning

2.1 Problem Conceptualization

Proposition 1.

Proposition 2.

2.2 Robust φ𝜑\varphiitalic_φ-regularized fitted Q-iteration

2.3 Performance Guarantee: Suboptimality

Assumption 1 (Concentrability).

Assumption 2 (Approximate Robust Bellman Completeness).

Assumption 3 (Approximate Dual Realizability).

Theorem 1.

Remark 1.

3 Hybrid Robust φ𝜑\varphiitalic_φ-Regularized Reinforcement Learning

3.1 Problem Conceptualization

Corollary 1.

Corollary 2.

3.2 Hybrid Robust regularized Q-iteration

3.3 Cumulative Suboptimality Guarantee

Assumption 4 (Robust Bellman Error Transfer Coefficient).

Assumption 5 (Approximate Value Realizability and Robust Bellman Completeness).

Assumption 6 (Approximate Dual Realizability).

Assumption 7 (Bilinear Models).

Assumption 8 (Fail-state).

Theorem 2.

Remark 2.

4 Theoretical Discussions and Final Remarks

Remark 3.

5 Conclusion

Acknowledgment

References

☕ ☕ Supplementary Materials ☕ ☕

Appendix A Related Works ☕

Appendix B Useful Technical Results ☕☕

Lemma 1 (Levy et al.,, 2020, Section A.1.2).

Lemma 2 (Bernstein’s Inequality (Vershynin,, 2018, Theorem 2.8.4)).

Lemma 3 (Freedman’s Inequality (Song et al.,, 2023, Lemma 14)).

Lemma 4 (ERM Generalization Bound (Panaganti et al.,, 2022, Lemma 3)).

Remark 4.

Lemma 5 (Rockafellar and Wets,, 2009, Theorem 14.60).

Lemma 6 (Elliptical Potential Lemma).

Lemma 7 (Online Least-squares Generalization Bound (Song et al.,, 2023, Lemma 3)).

Appendix C Useful Foundational Results ☕☕☕

Proposition 3 (φ𝜑\varphiitalic_φ-Divergence Bounds).

Proof.

Proposition 4 (Online ERM Generalization Bound).

Proof.

Proposition 5.

Proof.

Proposition 6.

Proof.

Appendix D Offline Robust φ𝜑\varphiitalic_φ-regularized RL Results ☕☕☕

Proof of Proposition 1.

Proof of Proposition 2.

Proposition 7 (Dual Optimization Error Bound).

Proof.

Proposition 8 (Least squares generalization bound).

Proof.

D.1 Proof of Theorem 1 ☕☕☕

Theorem 3 (Restatement of Theorem 1).

Proof.

D.2 Specialized Result for TV φ𝜑\varphiitalic_φ-divergence ☕☕☕

Assumption 9 (Concentrability).

Assumption 10 (Fail-state).

Theorem 4.

Proof.

Appendix E Hybrid Robust φ𝜑\varphiitalic_φ-regularized RL Results ☕☕☕☕

Proposition 9 (Online Dual Optimization Error Bound).

Proof.

Proposition 10 (Online Least-squares Generalization Bound).

Proof.

E.1 Proof of Theorem 2 ☕☕☕☕

Theorem 5 (Restatement of Theorem 2).

Proof.

E.2 HyTQ Algorithm Specialized Results ☕☕☕

Lemma 8.

Proof.

Remark 5.

Assumption 11.

Model-Free Robust $\varphi$ -Divergence Reinforcement Learning
Using Both Offline and Online Data

2 Offline Robust $\varphi$ -Regularized Reinforcement Learning

2.2 Robust $\varphi$ -regularized fitted Q-iteration

3 Hybrid Robust $\varphi$ -Regularized Reinforcement Learning

Proposition 3 ( $\varphi$ -Divergence Bounds).

Appendix D Offline Robust $\varphi$ -regularized RL Results ☕☕☕

D.2 Specialized Result for TV $\varphi$ -divergence ☕☕☕

Appendix E Hybrid Robust $\varphi$ -regularized RL Results ☕☕☕☕