Model-Free Robust -Divergence Reinforcement Learning
Using Both Offline and Online Data
Kishan Panaganti, Adam Wierman, Eric Mazumdar
Computing + Mathematical Sciences Department, California Institute of Technology
Emails:{kpb, adamw, mazumdar}@caltech.edu
Abstract
The robust -regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings.
This work 111To appear in the proceedings of the International Conference on Machine Learning (ICML) 2024. makes two important contributions.
First, we propose a model-free algorithm called Robust -regularized fitted Q-iteration (RPQ) for learning an -optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model.
To the best of our knowledge, we provide the first unified analysis for a class of -divergences achieving robust optimal policies in high-dimensional systems with general function approximation.
Second, we introduce the hybrid robust -regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q).
To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust -regularized reinforcement learning framework.
Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.
Keywords: Robust reinforcement learning, model uncertainty, general function approximation
Online Reinforcement Learning (RL) agents learn through online interactions and exploration in environments and have been shown to perform well in structured domains such as Chess and Go (Silver et al.,, 2018), fast chip placements in semiconductors (Mirhoseini et al.,, 2021), fast transform computations in mathematics (Fawzi et al.,, 2022), and more. However, online RL agents (Botvinick et al.,, 2019) are known to suffer sample inefficiency due to complex exploration strategies in sophisticated environments.
To overcome this, learning from available historical data has been studied using offline RL protocols (Levine et al.,, 2020). However, offline RL agents suffer from out-of-data-distribution (Yang et al.,, 2021; Robey et al.,, 2020) due to the lack of online exploration.
Recent work Song et al., (2023) proposes another learning setting called hybrid RL that makes the best of both offline and online RL worlds. In particular, hybrid RL agents have access to both offline data (to reduce exploration overhead) and online interaction with the environment (to mitigate the out-of-data-distribution issue).
All three of these approaches (online, offline, and hybrid RL) require training environments (simulators) that closely represent real-world environments. However, time-varying real-world environments (Maraun,, 2016), sensor degradations (Chen et al.,, 1996), and other adversarial disturbances in practice (Pioch et al.,, 2009) mean that
even high-fidelity simulators are not enough (Schmidt et al.,, 2015; Shah et al.,, 2018).
RL agents are known to fail due to these mismatches between training and testing environments (Sünderhauf et al.,, 2018; Lesort et al.,, 2020). As a result, robust RL (Mankowitz et al.,, 2020; Panaganti and Kalathil, 2021a, ) has received increasing attention due to the potential for it to alleviate the issue of mismatches between the simulator and real-world environments.
Robust RL agents are built using the robust Markov Decision Process (RMDP) (Iyengar,, 2005; Nilim and El Ghaoui,, 2005) framework. In this framework, the goal is to find an optimal policy that is robust, i.e., performs uniformly well across a set of models (transition probability functions). This is formulated via a max-min problem, and the set of models is typically constructed around a simulator model (transition probability function) with some notion of divergence or distance function. We refer to the simulator model as any nominal model that is provided to RL agents.
The RMDP framework in RL is identical to the Distributionally Robust Optimization (DRO) framework in supervised learning (Duchi and Namkoong,, 2018; Chen et al.,, 2020).
Similar to RMDP, DRO is a min-max problem aiming to minimize a loss function uniformly over the set of distributions constructed around the training distribution of the input space.
However, developing model-free algorithms for DRO problems with general -divergences (see Eq.1) is known to be hard (Namkoong and Duchi,, 2016) due to their inherent non-linear and multi-level optimization structure.
Additionally, developing model-free robust RL agents is also challenging (Iyengar,, 2005; Duchi and Namkoong,, 2018) for high-dimensional sequential decision-making systems under general function approximation.
To overcome this issue, in this work, we develop robust RL agents for the RRMDP framework, which is an equivalent alternative form of RMDP. A natural -divergence regularization extension to the problem of RMDP gives way for this new RRMDP framework introduced in Yang et al., (2023); Zhang et al., (2023), under different names. It is built upon the penalized DRO problem (Levy et al.,, 2020; Jin et al., 2021b, ), that is, the -divergence regularization version of the DRO problem.
In particular, we focus on developing an offline robust RL algorithm for a class of -divergences under the RRMDP framework with arbitrarily large state spaces, using only offline data with general function approximation. Towards this, as the first main contribution, we propose the Robust -regularized fitted Q-iteration model-free algorithm and provide its performance guarantee for a class of -divergences with a unified analysis. We refer to algorithms as model-free if they do not explicitly estimate the underlying nominal model.
We address the following important (suboptimality and sample complexity) questions: What is the rate of suboptimality gap achieved between the optimal robust value and the value of RPQ policy? How many offline data samples from the nominal model are required to learn an -optimal robust policy?
We discuss challenges and present these results in Section2.
Table 1: Comparison of model-free -divergence robust RL algorithms.
In the algorithm-type column, Fitted Q-Iteration (FQI) uses least-squares regression and Q-Learning (QL) uses stochastic approximation updates.
In the data coverage column,
uniform-policy stipulates a data-generating policy to cover the entire state-action space. all-policy is where the data-generating policy should cover the state-action space covered by all non-stationary policies, and single-policy is where it covers the state-action space covered by the optimal robust policy, on the nominal model. ∗ denotes the coverage should include all the models in robust sets designed by the divergences in the robust column.
The dataset type column mentions the type of dataset collected with a data-generating policy for training corresponding algorithms where offline indicates i.i.d. historical dataset on the nominal model, offline Markov indicates Markovian dataset induced on the nominal model, and online non-Markov indicates a history dependent dataset as a collection of Markovian datasets induced on the nominal model by a set of learned policies.
Finally, the suboptimality column is the statistical upper bound for the difference between the optimal robust value and the robust value achieved by the algorithm.
Here is either or effective horizon factors.
is the robustness radius parameter in RMDPs and is the robustness penalization parameter in RRMDPs, which are inversely related (Yang et al.,, 2023, Theorem 3.1). is some function on that varies according to different -divergences.
is the dataset size used by algorithms. † The bound of HyTQ is not directly comparable with others in terms of since the non-stationary finite-horizon setting requires multiplicity in dataset size.
is the minimal positive value of data generating stationary distribution , i.e. .
and are two function representations, and is the state-action space.
In this work, we also develop and study a novel hybrid robust RL algorithm under the RRMDP framework using both offline data and online interactions with the nominal model.
We make this second main contribution to this work since hybrid RL overcomes the out-of-data-distribution issue in offline RL.
Towards this, we propose the Hybrid robust Total-variation-regularized Q-iteration algorithm and provide its performance guarantee under improved assumptions.
Notably, the offline data-generating distribution must only cover the distribution that the optimal robust policy samples out on the nominal model, whereas before we needed it to cover any distribution uniformly. This is how online interactions help mitigate the out-of-data-distribution issue of offline RL and offline robust RL.
We now address the cumulative suboptimality question in addition to sample complexity: What is the rate of cumulative suboptimality gap achieved between the optimal robust value and the value of HyTQ iteration policies?
We discuss challenges and present these results in Section3.
Related Work. Among all the previous works that provide model-free methods, here we only mention the ones closest to ours. We discuss more related works in AppendixA.
Panaganti et al., (2022) proposed a Q-iteration offline robust RL algorithm in the RMDP framework only for the total variation -divergence.
Bruns-Smith and Zhou, (2023) proposed a Q-iteration offline robust RL algorithm in the RMDP framework to solve causal inference under unobserved confounders.
Zhou et al., (2023) proposed an actor-critic robust RL algorithm in RMDP for integral probability metric.
Zhang et al., (2023) proposed a Q-iteration offline robust RL algorithm in the RRMDP framework only for the Kullback-Leibler -divergence.
Blanchet et al., (2023) proposed specialized robust RL algorithms for the total variation and Kullback-Leibler -divergences offering unified analyses for linear, kernels, and factored function approximation models under the finite state-action setting.
Other line of work (Liu et al.,, 2022; Liang et al.,, 2023; Wang et al., 2023a, ; Wang et al., 2023b, ; Yang et al.,, 2023) provide model-free robust RL algorithms based on classical Q-learning methods in finite state-action spaces.
We provide more insightful comparisons in Table1. To the best of our knowledge, this is the first work that addresses a wide class of robust RL problems (like the general -divergence) with arbitrary large state space using general function approximation under mild assumptions (like the robust Bellman error transfer coefficient).
Notation.
We use the equality sign (=) for pointwise equality in vectors and matrices. For any , let .
For any vector and positive semidefinite matrix , the squared matrix norm is .
The set of probability distributions over , with cardinality , is denoted as , and its power set sigma algebra as .
For any function that takes as input, define the expectation w.r.t. the dataset (or empirical expectation) as .
For any positive integer , set denotes .
Define and norms as and .
denotes a probability distribution is absolutely continuous w.r.t a probability distribution .
We use to ignore universal constants less than and to ignore universal constants less than and the polylog terms depending on problem parameters.
We start with preliminaries and the problem formulation.
Infinite-Horizon Markov Decision Process:
An infinite-horizon discounted Markov Decision Process (MDP) is a tuple where is a countably large state-space, is a finite set of actions, is a known stochastic reward function, is a probability transition function describing an environment, is a discount factor, and is the starting state distribution.
A stationary (stochastic) policy specifies a distribution over actions in each state. We denote the transition dynamic distribution at state-action as .
For convenience, we write and assume it is deterministic as in RL literature (Agarwal et al.,, 2019) since the performance guarantee will be identical up to a constant factor.
The value function of a policy is starting at state and for all . Similarly, we define an action-value function of a policy as
Each policy induces a discounted occupancy density over state-action pairs defined as , where denotes the visitation probability of state-action pair at time step , starting at and following on the model . The optimal policy achieves the maximum value of any policy .
Offline Reinforcement Learning:
The goal of offline RL on MDP is to learn a good policy (a policy with a high ) based only on the offline dataset.
An offline dataset is a historical and fixed dataset of interactions , where and the pairs are independently and identically generated according to a data distribution . For convenience, also denotes the offline/behavior policy that generates .
One classical offline RL algorithm with general function approximation capabilities with provable performance guarantees is Fitted Q-Iteration (FQI) (Szepesvári and Munos,, 2005; Chen and Jiang,, 2019; Liu et al.,, 2020). A function class (e.g., neural networks, kernel functions, linear functions, etc) represents -value functions of MDP . At each iteration, given and , FQI does the following least-square regression for the approximate squared Bellman error: , where .
In this regression step, FQI aims to find the optimal action-value by approximating the non-robust squared Bellman error () using offline data with function approximation .
Finally, for some starting state , the performance guarantee of an algorithm policy is given by bounding the suboptimality quantity .
Infinite-Horizon Robust -Regularized Markov Decision Process: Let be the nominal model, that is, a probability transition function describing a training environment. An infinite-horizon discounted Robust -Regularized Markov Decision Process (RRMDP) tuple where is a robustness parameter and is a convex function. The robust regularized reward function is defined as for any state-action pairs and any such that . Here is the -divergence (Csiszár,, 1967) defined as
(1)
for two probability distributions and with ,
where is convex on and differentiable on satisfying and for . Examples of -divergence include Total Variation (TV), Kullback-Leibler (KL), chi-square, Conditional Value at Risk (CVaR), and more (c.f. Proposition3).
The robust regularized value function of a policy is defined as
(2)
where and . By definition, for any , it follows that .
The optimal robust regularized value function is (similarly we can design ), and is the robust regularized optimal policy that achieves this optimal value.
For convenience, we denote () as ().
We note that satisfies the -rectangularity condition (Iyengar,, 2005) by definition. This is a sufficient condition for the optimization problem in (2) to be tractable. It also enables the existence of a deterministic policy for (Yang et al.,, 2023). We formally mention this in Proposition5.
For any policy , denote as the expected total reward with as initial state distribution.
Denote the robust regularized Bellman operator as
(3)
Since is a contraction (Yang et al.,, 2023), the robust Q-iteration (RQI) converges to . We get the robust optimal policy as .
2.1 Problem Conceptualization
In this section, we study the offline infinite-horizon robust -regularized RL (R3L) problem, acquiring useful insights to construct our algorithm (Algorithm1) in next section.
The goal here is to learn a good robust policy (a policy with a high ) based on the offline dataset.
We start by noting one key challenge in the estimation of the robust regularized Bellman operator (3): One may require many offline datasets from each to achieve our offline R3L goal.
In this work, we use the penalized Distributionally Robust Optimization (DRO) tool (Sinha et al.,, 2017; Levy et al.,, 2020; Jin et al., 2021b, ) to not require such unrealistic existence of offline datasets. In particular, as in non-robust offline RL, we only rely on the offline dataset generated on the nominal model by an offline policy .
This statement is justified via the following proposition.
Proposition 1.
Consider a robust -regularized MDP. For any , the robust regularized Bellman operator (3) can be equivalently written as
(4)
where and is some bounded real line which depends on .
A proof of this proposition is given in AppendixD and follows from Levy et al., (2020, Section A.1.2). We refer to (4) as the robust regularized Bellman dual operator. Observing the sole dependence on the nominal model in (4), one can come up with estimators for data-driven approaches that naturally depend only on the dataset .
We remark that we consider a class of -divergences satisfying the conditions in Proposition3 for all the results in this paper.
We now remark on a natural first attempt at performing the squared Bellman error least-square regression, like FQI, on the robust regularized Bellman dual operator (4). Observe that the true Bellman error involves solving an inner convex minimization problem in (4) for every . Since we are in a countably large state space regime, it is infeasible to devise approximations to this true squared Bellman error. In addition, we have to also enable general function architecture for action-values.
To alleviate this challenging task, we now turn our attention to the inner convex minimization problem in the robust regularized Bellman dual operator (4). Due to the -rectangularity assumption, we note that the ’s are not correlated across all . With this note, for every , we can replace in (4) with a dual-variable function . Thus, intuitively, multiple point-wise minimizations can be replaced by a single dual-variable functional minimization over the function space of . We formalize this intuition using variational functional analysis (Rockafellar and Wets,, 2009) for a countably large state space regime in the following.
We denote as the set of all absolutely integrable functions defined on the probability (measure) space with , the data generating distribution, as the -finite probability measure. To elucidate, is the set of all functions such that is finite. We set considering the inner minimization in (4).
Fixing any given function , we define the loss function , for all , as
(5)
We state the result for single dual-variable functional minimization intuition we developed in the previous paragraph. We also note one variant of this result appears in the distributionally robust RL work (Panaganti et al.,, 2022).
Proposition 2.
Let be the loss function defined in (5). Then, for any function , we have
(6)
We provide a proof in AppendixD, which relies on Rockafellar and Wets, (2009, Theorem 14.60).
For any given and , we define an operator , for all , as
(7)
This operator is useful in view of Propositions1 and 2. To see this, we first define for any action-value function . Now, by taking an expectation w.r.t the data generating distribution on (4), we observe by utilizing (6). Due to this observation, in the following subsection, we develop an algorithm by approximating both the optimal dual-variable function of optimal robust value and the robust squared Bellman error () using offline data .
Panaganti et al., (2022) similarly conceptualized their total variation -divergence robust RL algorithm. Here, Proposition1 enables us to conceptualize for general -divergence.
2.2 Robust -regularized fitted Q-iteration
In this section, we formally propose our algorithm based on the tools developed so far. Our proposed algorithm is called Robust -regularized fitted Q-iteration (RPQ) Algorithm and is summarized in Algorithm1. We first discuss the inputs to our algorithm. As mentioned above, we only use the offline dataset , generated according to a data distribution on the nominal model . We also consider two general function classes and representing action-value functions and dual-variable functions, respectively.
We now define useful approximation quantities for and . For given , the empirical loss function of the true loss Eq.5 on is
(8)
For given , the empirical squared robust regularized Bellman error on is
(9)
We start with an initial action-value function and execute the following two steps for iterations.
At iteration of the algorithm with input , as a first step, we compute a dual-variable function through the empirical risk minimization approach, that is, we solve (Line 4 of Algorithm1).
As a second step, given inputs and , we compute the next iterate through the least-squares regression method, that is, we solve (Line 5 of Algorithm1).
After iterations, we extract the greedy policy from (Line 7 of Algorithm1).
2.3 Performance Guarantee: Suboptimality
We now discuss the performance guarantee of our RPQ Algorithm. In particular, we characterize how close the robust regularized value function of our RPQ Algorithm is to the optimal robust regularized value function. We first mention all the assumptions about the data generating distribution and the representation power of and before we present our main results.
Assumption 1(Concentrability).
There exists a finite constant such that for any any policy and satisfying for all (both can be non-stationary), we have .
Assumption 1 stipulates the support set of the data generating distribution , i.e. , to cover the union of all support sets of the distributions , leading to a robust exploratory behavior.
This assumption is widely used in the offline RL literature (Munos,, 2003; Agarwal et al.,, 2019; Chen and Jiang,, 2019; Wang et al.,, 2021; Xie et al.,, 2021) in different forms. We adapt this assumption from the robust offline RL (Panaganti et al.,, 2022; Zhang et al.,, 2023).
Let be some small positive constant. For any , we have for the data generating distribution .
We note that Assumption2 holds trivially if is closed under , that is, for any and , if it holds that , then .
This assumption has been widely used in different forms in the non-robust offline RL literature (Agarwal et al.,, 2019; Wang et al.,, 2021; Xie et al.,, 2021) and robust offline RL literature (Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023; Zhang et al.,, 2023).
Assumption 3(Approximate Dual Realizability).
For all , there exists a uniform constant such that .
Assumption3 holds trivially if for any (since ). This assumption has been used in earlier robust offline RL literature (Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023).
Now we state our main theoretical result on the performance of the RPQ algorithm. In AppendixD we restate the result including the constant factors.
Theorem 1.
Let Assumptions1, 2 and 3 hold.
Let be problem-dependent constants for . Let be the RPQ algorithm policy after iterations. Then, for any , with probability at least , we have
Theorem1 states that the RPQ algorithm is approximately optimal.
This theorem also gives the sample complexity guarantee for finding an -suboptimal policy w.r.t. the optimal policy .
To see this, by neglecting the first term due to inevitable function class approximation errors, for we get with probability at least for any fixed .
Remark 1.
Note that the guarantee for the TV case in Theorem1 requires making another assumption on the existence of a fail-state (Panaganti et al.,, 2022, Lemma 3), Assumption8 replacing with .
However, we specialize Theorem1 for the TV case by relaxing Assumption1 to get the same guarantee, which we present in AppendixD. In particular, we relax Assumption1 to the non-robust offline RL concentrability assumption (Foster et al.,, 2022), i.e. we only need the distribution to be in the collection of discounted state-action occupancies on the nominal model .
In this section, we provide a hybrid robust -Regularized RL protocol to overcome the out-of-data-distribution issue in offline robust RL.
As in Song et al., (2023), we reformulate the problem in the finite-horizon setting to use its backward induction feature that enables RPQ iterates to run in each episode.
We again start by discussing preliminaries and the problem formulation.
Finite-Horizon Markov Decision Process:
A finite-horizon Markov Decision Process (MDP) is
, where is the horizon length, for any , is a known deterministic reward function and is the transition probability function at time .
A non-stationary (stochastic) policy where . We denote the transition dynamic distribution at time and state-action as .
Given , we define the state and action value functions in the usual manner: starting at state and , and starting at state-action and .
Given , occupancy measure over state-action pairs .
We write to denote an optimal non-stationary deterministic policy, which maximizes .
Hybrid Reinforcement Learning:
The goal of hybrid RL on MDP is to learn a good policy based on adaptive datasets consisting of both offline datasets and on-policy datasets.
Given timestep , offline dataset is generated by with the pairs i.i.d. sampled by offline data distribution. For convenience, also denotes the offline policy that generates .
Given timestep , on-policy dataset is generated by and for all the previously learned policies by the algorithm.
Song et al., (2023) proposes Hybrid Q-learning (HyQ) algorithm with general function approximation capabilities and provable guarantees for hybrid RL.
The HyQ algorithm (c.f. Song et al., (2023, Algorithm 1)) is quite straightforward: For each iteration , do backward induction of the FQI algorithm on timesteps using the adaptive datasets described above.
Finally, for some starting state , the performance guarantee of algorithm policies is given by bounding the cumulative suboptimality quantity .
We note the total adaptive dataset size is to provide comparable results with offline RL.
Finite-Horizon Robust -Regularized Markov Decision Process: Again, let be the nominal model. A finite-horizon discounted Robust -Regularized Markov Decision Process (RRMDP) tuple where is a robustness parameter and is as before. For , the robust regularized reward function is .
For , the robust regularized value function of a policy is defined as
where and .
By definition, for any , it follows that .
For , the optimal robust regularized value function is , and is the robust regularized optimal policy that achieves this optimal value.
For convenience, we denote () as () for all .
We again note that, for each , satisfies the -rectangularity condition (Iyengar,, 2005) by definition. It enables the existence of a non-stationary deterministic policy for (Zhang et al.,, 2023). We formalize this in Proposition6.
We denote as the expected total reward.
For convenience, we let for any . For any , denote the robust regularized Bellman operator as
(10)
As , doing backward iteration of , i.e., the robust dynamic programming , we get for all . For each timestep , we also get the robust optimal policy as .
3.1 Problem Conceptualization
In this section, we study the hybrid finite-horizon robust TV-regularized RL problem, acquiring the necessary insights to construct our algorithm (Algorithm2) in the next section.
We conceptualize for general -divergence, but only propose our algorithm for total variation -divergence.
The goal here is to learn a good robust policy based on adaptive datasets consisting of both offline datasets and on-policy datasets. We start by noting a direct consequence of Proposition1 due to similar inner minimization problems in both infinite horizon (3) and finite horizon (10) operators.
Corollary 1.
For any and , the robust regularized Bellman operator (10) can be equivalently written as
(11)
where and is some bounded real line that depends on .
As in Section2, this dual reformulation enables us to use the datasets from only the nominal model for estimating the robust regularized operator in its primal form (10).
We start by recalling the philosophy of the HyQ algorithm (Song et al.,, 2023) to use the FQI algorithm for adaptive datasets.
We do the same for our hybrid finite-horizon robust -regularized RL problem here.
For each , we need to estimate the true Bellman error using offline dataset from and the on-policy dataset from by the learned policies from the algorithm. We remark that the out-of-data-distribution issue appears when we only have access to the offline dataset to estimate the summation term above, which depends on .
As discussed in Section2, the true Bellman error itself involves solving an inner convex minimization problem in (11) for every and that is challenging for countably large state setting.
To alleviate this challenging task, we again utilize the functional minimization Proposition2 developed in Section2. For any , we denote the set of admissible distributions of nominal model as . Now we redefine dual loss for any , as
(12)
We state a direct consequence of Proposition2 here.
Corollary 2.
Let be the loss function defined in (12). Fix and consider any policy . Then, for any function and any , we have
(13)
For any given and , we redefine operator for all , as
(14)
We have all the necessary tools now.
In the following subsection, we develop an algorithm that naturally extends our RPQ algorithm using adaptive datasets.
3.2 Hybrid Robust regularized Q-iteration
In this section, we propose our algorithm based on the tools developed so far. Our proposed algorithm is called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q) Algorithm, summarized in Algorithm2. The total variation -divergence (1) is defined with . The inputs to this algorithm are the offline dataset, and two general function classes .
For any , and represent action-value functions and dual-variable functions at , respectively.
We redefine, using (17), the empirical dual loss and the robust empirical squared robust regularized Bellman error for dataset as
(15)
(16)
3.3 Cumulative Suboptimality Guarantee
We now discuss the performance guarantee in terms of the cumulative suboptimality of our HyTQ Algorithm. We first mention all the assumptions before we present our main result and add a brief discussion. We provide detailed discussion in Section4.
Assumption 4(Robust Bellman Error Transfer Coefficient).
Let be the offline data generating distribution. For any , there exists a small positive constant for the optimal policy that satisfies
We develop this assumption from non-robust offline RL work (Song et al.,, 2023).
Assumption 5(Approximate Value Realizability and Robust Bellman Completeness).
Let be small constant. For any and , we have for all . Furthermore, for any , we have .
Assumption 6(Approximate Dual Realizability).
Let be some small positive constant. For any and , we have , for all .
We adapt these two enhanced realizability assumptions from the non-robust offline RL literature (Xie et al.,, 2021; Foster et al.,, 2022; Song et al.,, 2023) to our problem.
The assumptions in Section2 are not directly comparable, but for the sake of exposition, let be the same across .
First, note that Assumption3 with all-policy concentrability (Assumption1) is equivalent to Assumption6. Second, Assumption2 implies . Now again, with all-policy concentrability (Assumption1), it is the approximate value realizability (Assumption5). We know non-robust offline RL is hard (Foster et al.,, 2022) with just realizability and all-policy concentrability. As robust RL is at least as hard as its non-robust counterpart (Panaganti and Kalathil,, 2022), we also assume Bellman completeness in Assumption5.
Assumption 7(Bilinear Models).
Consider any and . Let be greedy policy w.r.t .
There exists an unknown feature mapping and two unknown weight mappings with and
such that both
and holds.
We adapt this problem architecture assumption on with and for our setting from a series of non-robust online RL works (Jin et al., 2021a, ; Du et al.,, 2021).
Assumption 8(Fail-state).
There is a fail state for all , such that and , for all and satisfying for all .
This assumption enables us to ground the value of such ’s at to zero, which helps us to get a tight duality (c.f. (17)) without having to know the minimum value across large . There are approximations to this in the literature (Wang and Zou,, 2022). But we adopt this less restrictive assumption from Panaganti et al., (2022) for convenience.
Now we state our main theoretical result on the performance of the HyTQ algorithm. The proof is presented in AppendixE.
Theorem 2.
Let Assumptions4, 5, 6, 7 and 8 hold. Fix any . Then, HyTQ algorithm policies satisfy
with probability at least .
Remark 2.
We specialize this result for bilinear model examples, linear occupancy complexity model (Du et al.,, 2021, Definition 4.7) and low-rank feature selection model (Du et al.,, 2021, Definition A.1), in SectionE.2.
We also specialize this result using standard online-to-batch conversion (Shalev-Shwartz and Ben-David,, 2014) for uniform policy over HyTQ policies to provide sample complexity
in the SectionE.2.
4 Theoretical Discussions and Final Remarks
In this section, we first discuss the proof ideas for our results, focusing on discussions of the assumptions and their improvements. Next, we compare our results with the most relevant ones from the robust RL literature. Our Table1 should be used as a reference. Finally, we discuss the bilinear model architecture in detail, as ours is the first work to consider it in the robust RL setting under the general function architecture for the value and dual functions approximations.
Discussions on Proof Sketch:
We first discuss our RPQ algorithm (Algorithm1) result. We note that the concentrability (Assumption1) assumption requires the data-generating policy to be robust exploratory. That is, it covers the state-action occupancy induced by any policy and any -divergence set transition model.
We reiterate the proof idea of the suboptimality result (Panaganti et al.,, 2022, Theorem 1) of the RFQI algorithm (Panaganti et al.,, 2022, Algorithm 1). We highlight the most important differences with Panaganti et al., (2022); Zhang et al., (2023) here. Firsty, we generalize the robust performance lemma ( at Eq.26) for any general -divergence problem.
Secondly, we identify that it is hard to come up with a unified analysis for general -divergences in robust RL setting via the dual reformulation of the distributionally robust optimization problem (Duchi and Namkoong,, 2018, Proposition 1). Thus, a direct extension of the results in Panaganti et al., (2022) is hard for general -divergences. By RPQ analyses, we showcase that it is indeed possible to get a unified analysis for the robust RL problem using the RRMDP framework.
Thirdly, we show the generalization bounds for the empirical risk minimization (Proposition7) and least squares (Proposition8) estimators for general -divergences with unified results.
By these three points, equipped with the more general robust exploratory concentrability (Assumption1), we have a unified general -divergences suboptimality result (Theorem1) for the RPQ algorithm.
We now discuss our HyTQ algorithm (Algorithm2) result. We immediately make an important note here. The concentrability assumption improvement is two-fold: all-policy concentrability (Assumption9) to single concentrability, and then to the robust Bellman error transfer coefficient (Assumption4) via Lemma8.
We refer to Foster et al., (2022); Song et al., (2023) for further discussion on such concentrability assumption improvements and tightness in the non-robust offline RL. We leave it to future work for more tightness of these assumptions in the robust RL setting.
We execute a tighter analysis in our HyTQ algorithm result (Theorem2) compared to our RPQ algorithm TV -divergence specialized result (Theorem4). We summarize the steps as follows:
Step : We meticulously arrive at the following robust performance lemma (c.f. Eqs.37 and 39) for each algorithm iteration : We highlight that the first summand here depends on the samples from state-action occupancy of the optimal robust policy and for the second summand it is the w.r.t. the learned HyTQ policies. It is now intuitive to connect the first summand with the offline samples and the second with the online samples.
Finally, step : With the above gathered intuition, firstly, the history dependent dataset collected by different offline data-generating policy and the learned HyTQ policies on the nominal model warrants more sophisticated generalization bounds for the empirical risk minimization and least squares estimators. We prove a generalization bound for empirical risk minimization when the data are not necessarily i.i.d. but adapted to a stochastic process in AppendixC. This result is applicable to more machine learning problems outside of the scope of this paper as well.
Finally, equipped with the transfer coefficient (Assumption4) and bilinear model (Assumption7) assumptions for the nominal model , we formally show generalization bounds for the empirical risk minimization and least squares estimators in Propositions9 and 10 respectively.
We complete the proof by combining these two steps.
Remark 3.
We offer computational tractability in our RPQ and HyTQ algorithms due to the usage of empirical risk minimization (Steps 4 & 9 resp.), over the general function class , and least-squares (Steps 5 & 10 resp.), over the general function class , computationally tractable estimators. This two-step estimator update avoids the complexity of solving the inner problem for each state-action pair (leading to scaling issues for high-dimensional problems) in the original robust Bellman operators (Eqs.3 and 10).
To the best of our knowledge, no purely online or purely offline robust RL algorithms are known to be tractable in this sense, except other robust Q-iteration and actor-critic methods (discussed in Table1) and except under much stronger coverage conditions (like single-policy and uniform) in the tabular setting.
Theoretical Guarantee Discussions:
In the suboptimality result (Theorem1) for the RPQ algorithm (Algorithm1), we only mention the leading statistical bound with a problem-dependent (on -divergence) constant . We provide the exact constants pertaining to different -divergences in a restated statement of Theorem1 in Theorem3. Furthermore, the constants in Theorem3 take different values for different -divergences provided in Proposition3. Similarly, for the suboptimality result (Theorem2) of the HyTQ algorithm (Algorithm2), we provide a more detailed bound in a restated statement in Theorem5.
In the following we provide comparisons of suboptimality results with relevant prior works. But first, we make an important note here on , the robustness radius parameter in RMDPs, and , the robustness penalization parameter in RRMDPs, mentioned briefly in Table1.
(Levy et al.,, 2020; Yang et al.,, 2023) establish the regularized and constrained versions of DRO and robust MDP problems, respectively, are equivalent by connecting their respective ( and ) robustness parameters. Moreover, both observe rigorously that and are inversely related. This is intuitively true, as and both yield the non-robust solutions on the nominal model and as and both yield the conservative solutions considering the entire probability simplex for the transition dynamics.
However, it is an interesting open problem to establish an exact analytical relation between the robustness parameters and . We leave this to future research as it is out of the scope of this work.
Here we specialize our result (Theorem3) for the chi-square -divergence R3L problem. We get the suboptimality for the RPQ algorithm as , where we only have presented the higher-order terms. The suboptimality of Algorithm 2 in Yang et al., (2023, Theorem 5.1) for chi-square -divergence is stated for as where is described in Table1. We use the typical equivalence from RL literature for comparison between these two results in the tabular setting with generative/simulator modeling assumption: function approximation classes with full dimension yields (Panaganti et al.,, 2022) and uniform support data sampling yields and (Shi et al.,, 2023). Now our result with reduces to and their result (Yang et al.,, 2023) reduces to . Two comments warrant attention here. Firstly, compared to a model-based robust regularized algorithm (robust value iteration using empirical estimates of the nominal model ) (Yang et al.,, 2023, Theorem 3.2), our suboptimality bound is worse off by the factors and . We leave it to future work to fine-tune and get optimal rates. Secondly, their result Yang et al., (2023, Theorem 5.1) exhibit inferior performance compared to ours in all parameters, but we do want to note that they make a first attempt to give suboptimality bounds for the stochastic approximation-based algorithm. The dependence on is typically known to be bad using the stochastic approximation technical tool (Chen et al.,, 2022), and Yang et al., (2023, Discussion on Page 16) conjectures using the Polyak-averaging technique to improve their suboptimality bound rate to .
Here we discuss and compare our result for the total variation -divergence setting. As mentioned in Remark1, we have a specialized result in SectionD.2 for the total variation -divergence.
We get the suboptimality result (Theorem4) for the RPQ algorithm as , where we again only have presented the higher-order terms. Panaganti et al., (2022, Theorem 1) mentioned in Table1 also exhibits same suboptimality guarantee replacing with . As we noted before, (the robustness radius parameter in RMDPs) and (the robustness penalization parameter in RRMDPs) are inversely related, and for the TV -divergence we observe a straightforward relation between the two as .
Using the earlier arguments for a tabular setting bound, our result further reduces to . Now comparing this to the minimax lower bound (Shi et al.,, 2023, Theorem 2), our suboptimality bound is worse off by the factors and .
Nevertheless, we push the boundaries by providing novel suboptimality guarantee studying the robust RL problem in the hybrid RL setting. Furthermore, as mentioned earlier in Remark2, we provide the offline+online robust RL suboptimality guarantee in the AppendixE.
We also remark that the HyTQ algorithm can be proposed under the RMDP setting with a similar suboptimality guarantee due to the similarity of the dual Bellman equations under the TV -divergence for RMDPs and RRMDPs (c.f. Eq.33 and Xu∗ et al., (2023, Lemma 8)). For the sake of consistency and novelty, we present our results solely for the RRMDP setting.
As mentioned earlier, the concentrability assumption improvement is two-fold (Lemma8): all-policy concentrability (Assumption9) to single concentrability to transfer coefficient.
This is the first of its kind result that does not yet have any existing lower bounds to compare in the robust RL setting.
Under similar transfer coefficient, Bellman completeness, and bilinear model assumptions, the HyTQ algorithm sample complexity (Corollary5) is comparable to that of a non-robust RL algorithm (Song et al.,, 2023), i.e., .
We leave it to future work for developing minimax rates and getting optimal algorithm guarantees.
Here we specialize our result (Theorem3) for the KL -divergence R3L problem. We get the suboptimality for RPQ as , where we only have presented the higher-order terms. Using the earlier arguments for a tabular setting bound, our result with again reduces to . Zhang et al., (2023, Theorem 5) mentioned in Table1 also exhibits same suboptimality guarantee.
Two remarks are in order here.
Firstly, we remark that our RPQ algorithm and its theoretical guarantee unifies for a class of -divergence classes, whereas Zhang et al., (2023, Algorithm 1) is specialized for the KL -divergence. This steers towards our first main contribution discussed in Section1.
Secondly, we remark the robust regularized Bellman operator Eq.3 for the KL -divergence has a special form due to the existence of an analytical worse-case transition model. This arrives at a special structure of the form of an exponential robust Bellman operator in a Q-value-variant space. This special structure helps avoid the dual variable function update (Step 4) in the RPQ algorithm and the factor in the suboptimal guarantee. We choose not to include this specialized result in this work (like we did for the TV -divergence in SectionD.2) and directly point to Zhang et al., (2023). We do highlight here an important note for such a choice in our paper. The abovementioned special structure forces us to get online samples from all the transition kernels (c.f. Assumption1), which is unrealistic in practice, to achieve an improvement in the hybrid robust RL setting.
We leave it to future work for developing such improved algorithm guarantees in the hybrid robust RL setting for other -divergences.
Discussion of Bilinear Models in the Hybrid Robust RL setting:
We emphasize that while our bilinear model for the HyTQ algorithm is specialized to low occupancy complexity (i.e. the occupancy measures themselves have a low-rank structure) and low-rank feature selection model (i.e. the nominal model has a low-rank structure) in SectionE.2, the function classes (Q-value representations) and (dual-value representations) can be arbitrary, potentially nonlinear function classes (neural tangent kernels, neural networks, etc).
Thus, even in the tabular setting with large state space (e.g. ) for the bilinear model, our suboptimality bounds only scale with the complexity of the function classes and , which can considerably be low compared to . For example, linear function approximators (e.g. linear feature dimension ), RKHS approximators with low dimension features, neural tangent kernels with low effective neural net dimension, and more function approximators. Moreover, our work solves the robust RL problem with more nuances, which is at least as hard as the non-robust RL problem. Thus, due to the new upcoming research status of robust RL in the general function approximation setting, we believe it is currently out of scope for this work to satisfy more general bilinear model classes (Du et al.,, 2021). Nevertheless, our initial findings for robust RL by the HyTQ algorithm in the hybrid learning setting reveal the hardness of finding larger model classes for RRMDPs with general -divergences.
We conclude this section with an exciting future research direction that remains unsolved in this paper. To solve the hybrid robust RL problem for general -divergence. In this work, we noticed while building hybrid learning for robust RL that one would require online samples from the worse-case model (c.f. the model that solves the inner problem in robust Bellman operator Eq.10) for general -divergences due to the current analyses dependent on the bilinear models. We use the dual reformulation for the total variation -divergence and provide current results supporting the HyTQ algorithm. We remark that using the same approach for other general -divergences, we get exponential dependence on the horizon factor. This warrants more sophisticated algorithm designs for the hybrid robust RL problem under general -divergences.
5 Conclusion
In this work, we presented two robust RL algorithms. We proposed Robust -divergence-fitted Q-iteration algorithm for general -divergence in the offline RL setting. We provided performance guarantees with unified analysis for all -divergences with arbitrarily large state space using function approximation. To mitigate the out-of-data-distribution issue by improving the assumptions on data generation, we proposed a novel framework called hybrid robust RL that uses both offline and online interactions. We proposed the Total-variation-divergence Q-iteration algorithm in this framework with an accompanying guarantee. We have provided our theoretical guarantees in terms of suboptimality and sample complexity for both offline and offline+online robust RL settings. We also rigorously specialized our results to different -divergences and different bilinear modeling assumptions. We have provided detailed comparisons with relevant prior works while also discussing interesting future directions in the field of robust reinforcement learning.
Acknowledgment
KP acknowledges support from the ‘PIMCO Postdoctoral Fellow in Data Science’ fellowship at the California Institute of Technology.
This work acknowledges support from NSF CNS-2146814, CPS-2136197, CNS-2106403, NGSDI-2105648, and funding from the Resnick Institute.
EM acknowledges support from NSF award 2240110.
We thank several anonymous ICML 2024 reviewers for their constructive comments on an earlier draft of this paper.
References
Agarwal et al., (2019)
Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019).
Reinforcement learning: Theory and algorithms.
CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
Antos et al., (2008)
Antos, A., Szepesvári, C., and Munos, R. (2008).
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path.
Machine Learning, 71(1):89–129.
Bertsimas et al., (2018)
Bertsimas, D., Gupta, V., and Kallus, N. (2018).
Data-driven robust optimization.
Math. Program., 167(2):235–292.
Blanchet et al., (2019)
Blanchet, J., Kang, Y., and Murthy, K. (2019).
Robust wasserstein profile inference and applications to machine learning.
Journal of Applied Probability, 56(3):830–857.
Blanchet et al., (2023)
Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023).
Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage.
Advances in Neural Information Processing Systems, 36.
Botvinick et al., (2019)
Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019).
Reinforcement learning, fast and slow.
Trends in cognitive sciences, 23(5):408–422.
Bruns-Smith and Zhou, (2023)
Bruns-Smith, D. and Zhou, A. (2023).
Robust fitted-q-evaluation and iteration under sequentially exogenous unobserved confounders.
arXiv preprint arXiv:2302.00662.
Chen and Jiang, (2019)
Chen, J. and Jiang, N. (2019).
Information-theoretic considerations in batch reinforcement learning.
In International Conference on Machine Learning, pages 1042–1051.
Chen et al., (1996)
Chen, J., Patton, R. J., and Zhang, H.-Y. (1996).
Design of unknown input observers and robust fault detection filters.
International Journal of control, 63(1):85–105.
Chen et al., (2020)
Chen, R., Paschalidis, I. C., et al. (2020).
Distributionally robust learning.
Foundations and Trends® in Optimization, 4(1-2):1–243.
Chen et al., (2022)
Chen, Z., Khodadadian, S., and Maguluri, S. T. (2022).
Finite-sample analysis of off-policy natural actor–critic with linear function approximation.
IEEE Control Systems Letters, 6:2611–2616.
Corporation, (2021)
Corporation, N. (2021).
Closing the sim2real gap with nvidia isaac sim and nvidia isaac replicator.
Csiszár, (1967)
Csiszár, I. (1967).
Information-type measures of difference of probability distributions and indirect observation.
studia scientiarum Mathematicarum Hungarica, 2:229–318.
Du et al., (2021)
Du, S., Kakade, S., Lee, J., Lovett, S., Mahajan, G., Sun, W., and Wang, R. (2021).
Bilinear classes: A structural framework for provable generalization in rl.
In International Conference on Machine Learning, pages 2826–2836.
Duchi and Namkoong, (2018)
Duchi, J. and Namkoong, H. (2018).
Learning models with uniform performance via distributionally robust optimization.
arXiv preprint arXiv:1810.08750.
Farahmand et al., (2010)
Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010).
Error propagation for approximate policy and value iteration.
Advances in Neural Information Processing Systems, 23.
Fawzi et al., (2022)
Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. (2022).
Discovering faster matrix multiplication algorithms with reinforcement learning.
Nature, 610(7930):47–53.
Foster et al., (2022)
Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. (2022).
Offline reinforcement learning: Fundamental barriers for value function approximation.
arXiv preprint arXiv:2111.10919.
Fujimoto and Gu, (2021)
Fujimoto, S. and Gu, S. S. (2021).
A minimalist approach to offline reinforcement learning.
Advances in neural information processing systems, 34:20132–20145.
Fujimoto et al., (2019)
Fujimoto, S., Meger, D., and Precup, D. (2019).
Off-policy deep reinforcement learning without exploration.
In International Conference on Machine Learning, pages 2052–2062.
Gao and Kleywegt, (2022)
Gao, R. and Kleywegt, A. (2022).
Distributionally robust stochastic optimization with wasserstein distance.
Mathematics of Operations Research.
Huang et al., (2023)
Huang, A., Chen, J., and Jiang, N. (2023).
Reinforcement learning in low-rank mdps with density features.
In International Conference on Machine Learning, pages 13710–13752.
Iyengar, (2005)
Iyengar, G. N. (2005).
Robust dynamic programming.
Mathematics of Operations Research, 30(2):257–280.
(24)
Jin, C., Liu, Q., and Miryoosefi, S. (2021a).
Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms.
Advances in neural information processing systems, 34:13406–13418.
(25)
Jin, J., Zhang, B., Wang, H., and Wang, L. (2021b).
Non-convex distributionally robust optimization: Non-asymptotic analysis.
Advances in Neural Information Processing Systems, 34:2771–2782.
Kostrikov et al., (2021)
Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. (2021).
Offline reinforcement learning with fisher divergence critic regularization.
In International Conference on Machine Learning, pages 5774–5783.
Kumar et al., (2019)
Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019).
Stabilizing off-policy q-learning via bootstrapping error reduction.
In Advances in Neural Information Processing Systems, pages 11784–11794.
Kumar et al., (2020)
Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020).
Conservative q-learning for offline reinforcement learning.
Advances in Neural Information Processing Systems, 33:1179–1191.
Lange et al., (2012)
Lange, S., Gabel, T., and Riedmiller, M. (2012).
Batch reinforcement learning.
In Reinforcement learning, pages 45–73. Springer.
Lattimore and Szepesvári, (2020)
Lattimore, T. and Szepesvári, C. (2020).
Bandit algorithms.
Cambridge University Press.
Lesort et al., (2020)
Lesort, T., Lomonaco, V., Stoian, A., Maltoni, D., Filliat, D., and Díaz-Rodríguez, N. (2020).
Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges.
Information fusion, 58:52–68.
Levine et al., (2020)
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020).
Offline reinforcement learning: Tutorial, review, and perspectives on open problems.
arXiv preprint arXiv:2005.01643.
Levy et al., (2020)
Levy, D., Carmon, Y., Duchi, J. C., and Sidford, A. (2020).
Large-scale methods for distributionally robust optimization.
Advances in Neural Information Processing Systems, 33:8847–8860.
Liang et al., (2023)
Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023).
Single-trajectory distributionally robust reinforcement learning.
arXiv preprint arXiv:2301.11721.
Liu et al., (2020)
Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020).
Provably good batch off-policy reinforcement learning without great exploration.
In Neural Information Processing Systems.
Liu et al., (2022)
Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022).
Distributionally robust -learning.
In International Conference on Machine Learning, pages 13623–13643.
Mankowitz et al., (2020)
Mankowitz, D. J., Levine, N., Jeong, R., Abdolmaleki, A., Springenberg, J. T., Shi, Y., Kay, J., Hester, T., Mann, T., and Riedmiller, M. (2020).
Robust reinforcement learning for continuous control with model misspecification.
In International Conference on Learning Representations.
Mannor et al., (2016)
Mannor, S., Mebel, O., and Xu, H. (2016).
Robust mdps with k-rectangular uncertainty.
Mathematics of Operations Research, 41(4):1484–1509.
Maraun, (2016)
Maraun, D. (2016).
Bias correcting climate change simulations-a critical review.
Current Climate Change Reports, 2:211–220.
Mirhoseini et al., (2021)
Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J. W., Songhori, E., Wang, S., Lee, Y.-J., Johnson, E., Pathak, O., Nazi, A., et al. (2021).
A graph placement methodology for fast chip design.
Nature, 594(7862):207–212.
Munos, (2003)
Munos, R. (2003).
Error bounds for approximate policy iteration.
In ICML, volume 3, pages 560–567.
Munos, (2007)
Munos, R. (2007).
Performance bounds in l_p-norm for approximate value iteration.
SIAM journal on control and optimization, 46(2):541–561.
Munos and Szepesvári, (2008)
Munos, R. and Szepesvári, C. (2008).
Finite-time bounds for fitted value iteration.
Journal of Machine Learning Research, 9(27):815–857.
Namkoong and Duchi, (2016)
Namkoong, H. and Duchi, J. C. (2016).
Stochastic gradient methods for distributionally robust optimization with f-divergences.
Advances in neural information processing systems, 29.
Nilim and El Ghaoui, (2005)
Nilim, A. and El Ghaoui, L. (2005).
Robust control of Markov decision processes with uncertain transition matrices.
Operations Research, 53(5):780–798.
Panaganti, (2023)
Panaganti, K. (2023).
Robust Reinforcement Learning: Theory and Algorithms.
PhD thesis, Texas A&M University.
(47)
Panaganti, K. and Kalathil, D. (2021a).
Robust reinforcement learning using least squares policy iteration with provable performance guarantees.
In International Conference on Machine Learning (ICML), pages 511–520.
(48)
Panaganti, K. and Kalathil, D. (2021b).
Sample complexity of model-based robust reinforcement learning.
In 2021 60th IEEE Conference on Decision and Control (CDC), pages 2240–2245.
Panaganti and Kalathil, (2022)
Panaganti, K. and Kalathil, D. (2022).
Sample complexity of robust reinforcement learning with a generative model.
In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 9582–9602.
Panaganti et al., (2022)
Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2022).
Robust reinforcement learning using offline data.
Advances in Neural Information Processing Systems (NeurIPS).
(51)
Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023a).
Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage.
arXiv preprint arXiv:2310.18434.
(52)
Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. (2023b).
Distributionally robust behavioral cloning for robust imitation learning.
In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347.
Pioch et al., (2009)
Pioch, N. J., Melhuish, J., Seidel, A., Santos Jr, E., Li, D., and Gorniak, M. (2009).
Adversarial intent modeling using embedded simulation and temporal bayesian knowledge bases.
In Modeling and Simulation for Military Operations IV, volume 7348, pages 115–126.
Robey et al., (2020)
Robey, A., Hassani, H., and Pappas, G. J. (2020).
Model-based robust deep learning: Generalizing to natural, out-of-distribution data.
arXiv preprint arXiv:2005.10247.
Rockafellar and Wets, (2009)
Rockafellar, R. T. and Wets, R. J.-B. (2009).
Variational analysis, volume 317.
Springer Science & Business Media.
Russel and Petrik, (2019)
Russel, R. H. and Petrik, M. (2019).
Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps.
Advances in Neural Information Processing Systems.
Scherrer et al., (2015)
Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M. (2015).
Approximate modified policy iteration and its application to the game of tetris.
J. Mach. Learn. Res., 16(49):1629–1676.
Schmidt et al., (2015)
Schmidt, T., Hertkorn, K., Newcombe, R., Marton, Z., Suppa, M., and Fox, D. (2015).
Depth-based tracking with physical constraints for robot manipulation.
In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 119–126.
Schulman et al., (2015)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015).
Trust region policy optimization.
In International conference on machine learning, pages 1889–1897.
Schulman et al., (2017)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017).
Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347.
Shah et al., (2018)
Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018).
Airsim: High-fidelity visual and physical simulation for autonomous vehicles.
In Field and Service Robotics: Results of the 11th International Conference, pages 621–635. Springer.
Shalev-Shwartz and Ben-David, (2014)
Shalev-Shwartz, S. and Ben-David, S. (2014).
Understanding machine learning: From theory to algorithms.
Cambridge university press.
Shapiro, (2017)
Shapiro, A. (2017).
Distributionally robust stochastic programming.
SIAM Journal on Optimization, 27(4):2258–2275.
Shi and Chi, (2022)
Shi, L. and Chi, Y. (2022).
Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity.
arXiv preprint arXiv:2208.05767.
Shi et al., (2023)
Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M., and Chi, Y. (2023).
The curious price of distributional robustness in reinforcement learning with a generative model.
Advances in Neural Information Processing Systems, 36.
Silver et al., (2018)
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018).
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.
Science, 362(6419):1140–1144.
Sinha et al., (2017)
Sinha, A., Namkoong, H., and Duchi, J. C. (2017).
Certifiable distributional robustness with principled adversarial training. corr, abs/1710.10571.
arXiv preprint arXiv:1710.10571.
Song et al., (2023)
Song, Y., Zhou, Y., Sekhari, A., Bagnell, D., Krishnamurthy, A., and Sun, W. (2023).
Hybrid rl: Using both offline and online data can make rl efficient.
In The Eleventh International Conference on Learning Representations.
Sünderhauf et al., (2018)
Sünderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., Upcroft, B., Abbeel, P., Burgard, W., Milford, M., et al. (2018).
The limits and potentials of deep learning for robotics.
The International journal of robotics research, 37(4-5):405–420.
Szepesvári and Munos, (2005)
Szepesvári, C. and Munos, R. (2005).
Finite time bounds for sampling based fitted value iteration.
In Proceedings of the 22nd international conference on Machine learning, pages 880–887.
Van Erven et al., (2015)
Van Erven, T., Grunwald, P., Mehta, N. A., Reid, M., Williamson, R., et al. (2015).
Fast rates in statistical and online learning.
JMLR.
Vershynin, (2018)
Vershynin, R. (2018).
High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47.
Cambridge University press.
Wang et al., (2021)
Wang, R., Foster, D., and Kakade, S. M. (2021).
What are the statistical limits of offline {rl} with linear function approximation?
In International Conference on Learning Representations.
(74)
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023a).
A finite sample complexity bound for distributionally robust q-learning.
In International Conference on Artificial Intelligence and Statistics, pages 3370–3398.
(75)
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b).
Sample complexity of variance-reduced distributionally robust q-learning.
arXiv preprint arXiv:2305.18420.
(76)
Wang, Y., Hu, Y., Xiong, J., and Zou, S. (2023c).
Achieving minimax optimal sample complexity of offline reinforcement learning: A dro-based approach.
arXiv preprint arXiv:2305.13289v2.
Wang and Zou, (2021)
Wang, Y. and Zou, S. (2021).
Online robust reinforcement learning with model uncertainty.
Advances in Neural Information Processing Systems, 34:7193–7206.
Wang and Zou, (2022)
Wang, Y. and Zou, S. (2022).
Policy gradient method for robust reinforcement learning.
In International Conference on Machine Learning, pages 23484–23526.
Wiesemann et al., (2013)
Wiesemann, W., Kuhn, D., and Rustem, B. (2013).
Robust Markov decision processes.
Mathematics of Operations Research, 38(1):153–183.
Xie et al., (2021)
Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021).
Bellman-consistent pessimism for offline reinforcement learning.
Advances in neural information processing systems, 34.
Xu and Mannor, (2010)
Xu, H. and Mannor, S. (2010).
Distributionally robust Markov decision processes.
In Advances in Neural Information Processing Systems, pages 2505–2513.
Xu∗ et al., (2023)
Xu∗, Z., Panaganti∗, K., and Kalathil, D. (2023).
Improved sample complexity bounds for distributionally robust reinforcement learning.
In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. Conference on Artificial Intelligence and Statistics.
Yang et al., (2021)
Yang, J., Zhou, K., Li, Y., and Liu, Z. (2021).
Generalized out-of-distribution detection: A survey.
arXiv preprint arXiv:2110.11334.
Yang et al., (2023)
Yang, W., Wang, H., Kozuno, T., Jordan, S. M., and Zhang, Z. (2023).
Avoiding model estimation in robust markov decision processes with a generative model.
arXiv preprint arXiv:2302.01248.
Yu and Xu, (2015)
Yu, P. and Xu, H. (2015).
Distributionally robust counterpart in Markov decision processes.
IEEE Transactions on Automatic Control, 61(9):2538–2543.
Zhang et al., (2023)
Zhang, R., Hu, Y., and Li, N. (2023).
Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity.
arXiv preprint arXiv:2306.11626.
Zhou et al., (2023)
Zhou, R., Liu, T., Cheng, M., Kalathil, D., Kumar, P., and Tian, C. (2023).
Natural actor-critic for robust reinforcement learning with function approximation.
In Thirty-seventh Conference on Neural Information Processing Systems.
☕ ☕ Supplementary Materials ☕ ☕
Appendix A Related Works ☕
Offline RL:
Offline RL tackles the problem of learning optimal policy using minimal amount of offline/historical data collected according to a behavior policy (Lange et al.,, 2012; Levine et al.,, 2020). Due to offline data quality and no access to simulators or any world models for exploration, the offline RL problem suffers from the out-of-distribution (Robey et al.,, 2020; Yang et al.,, 2021) challenge.
Many works (Fujimoto et al.,, 2019; Kumar et al.,, 2019, 2020; Fujimoto and Gu,, 2021; Kostrikov et al.,, 2021) have introduced deep offline RL algorithms aimed at alleviating the out-of-distribution issue by some variants of trust-region optimization (Schulman et al.,, 2015, 2017). The earliest and most promising theoretical investigations into model-free offline RL methodologies relied on the assumption of uniformly bounded concentrability such as the approximate modified policy iteration (AMPI) algorithm (Scherrer et al.,, 2015) and fitted Q-iteration (FQI) (Munos and Szepesvári,, 2008) algorithm. This assumption mandates that the ratio of the state-action occupancy distribution induced by any policy to the data generating distribution remains uniformly bounded across all states and actions (Munos,, 2007; Antos et al.,, 2008; Munos and Szepesvári,, 2008; Farahmand et al.,, 2010; Chen and Jiang,, 2019).
This makes offline RL particularly challenging (Foster et al.,, 2022) and there have been efforts to understand the limits of this setting.
Robust RL:
The robust Markov decision process framework (Nilim and El Ghaoui,, 2005; Iyengar,, 2005) tackles the challenge of formulating a policy resilient to model discrepancies between training and testing environments. Robust reinforcement learning problem pursues this objective in the data-driven domain. Deploying simplistic RL policies (Corporation,, 2021) can lead to catastrophic outcomes when faced with evident disparities in models.
The optimization techniques and analyses in robust RL draw inspiration from the distributionally robust optimization (DRO) toolkit in supervised learning (Duchi and Namkoong,, 2018; Shapiro,, 2017; Gao and Kleywegt,, 2022; Bertsimas et al.,, 2018; Namkoong and Duchi,, 2016; Blanchet et al.,, 2019).
Many heuristic works (Xu and Mannor,, 2010; Wiesemann et al.,, 2013; Yu and Xu,, 2015; Mannor et al.,, 2016; Russel and Petrik,, 2019) show robust RL is valuable in such scenarios involving disparities of a simulator model with the real-world model.
Many recent works address fundamental issues of RMDP giving concrete theoretical understanding in terms of sample complexity (Panaganti and Kalathil, 2021b, ; Panaganti and Kalathil,, 2022; Xu∗ et al.,, 2023; Shi and Chi,, 2022; Shi et al.,, 2023). Many works (Panaganti and Kalathil, 2021a, ; Wang and Zou,, 2021; Panaganti and Kalathil,, 2022) devise model-free online and offline robust RL algorithms employing general function approximation to handle potentially infinite state spaces. Recent work (Panaganti et al., 2023b, ) introduces distributional robustness in the imitation learning setting.
There have been works (Panaganti,, 2023; Panaganti et al., 2023a, ; Wang et al., 2023c, ) connecting robust RL with offline RL by linking notions of robustness and pessimism.
Appendix B Useful Technical Results ☕☕
We state the following result from the penalized distributionally robust optimization literature (Levy et al.,, 2020).
Fix any . If are independent and identically distributed random variables with finite second moment. Assume that , for all . Then we have with probability at least :
We now state a useful concentration inequality when the samples are not necessarily i.i.d. but adapted to a stochastic process.
Lemma 3(Freedman’s Inequality (Song et al.,, 2023, Lemma 14)).
Let be a sequence of -bounded real valued random variables where from some stochastic process that depends on the history . Then, for any and , we have with probability at least :
We now state a result
for the generalization bounds on empirical risk minimization (ERM) (Shalev-Shwartz and Ben-David,, 2014).
Lemma 4(ERM Generalization Bound (Panaganti et al.,, 2022, Lemma 3)).
Let be the data generating distribution on the space and let be a given hypothesis class of functions. Assume that for all and for loss function we have that for some positive constant and is -Lipschitz in . Given a dataset , generated independently from , denote as the ERM solution, i.e. . Furthermore, let be a finite hypothesis class, i.e. , with for all and .
For any fixed and , we have
with probability at least .
We now state a result from variational analysis literature (Rockafellar and Wets,, 2009) that is useful to relate minimization of integrals and the integrals of pointwise minimization under decomposable spaces.
Remark 4.
A few examples of decomposable spaces are , for any , and , the space of all -measurable functions.
Lemma 5(Rockafellar and Wets,, 2009, Theorem 14.60).
Let be a space of measurable functions from to that is decomposable relative to a -finite measure on the -algebra . Let (finite-valued) be a normal integrand. Then, we have
Moreover, as long as the above infimum is finite, we have that
if and only if for -almost everywhere.
Now we state a few results that will be useful for the analysis of our finite-horizon results in this work.
The following result (Song et al.,, 2023, Lemma 6) is useful under the use of bilinear model approximation. This result follows from the elliptical potential lemma (Lattimore and Szepesvári,, 2020, Lemma 19.4) for deterministic vectors.
Lemma 6(Elliptical Potential Lemma).
Let be a sequence of vectors with for all and fix . Define for . Then, the following holds:
We now state a result for the generalization bounds on the least-squares regression problem when the data are not necessarily i.i.d. but adapted to a stochastic process. We refer to Van Erven et al., (2015) for more statistical and online learning generalization bounds for a wider class of loss functions.
Let , , and let be an input space and be a target space . Let be a given real-valued hypothesis class of functions with . Given a dataset , denote as the least square solution, i.e. .
The dataset is generated as from some stochastic process that depends on the history , and is sampled via the conditional probability as
where the function satisfies approximate realizability i.e. and are independent random variables such that . Suppose it also holds and . Then, the least square solution satisfies with probability at least :
Appendix C Useful Foundational Results ☕☕☕
We provide the following result highlighting the necessary characteristics for specific examples of the Fenchel conjugate functions .
Proposition 3(-Divergence Bounds).
Let be any value function and fix a probability distribution . Define . Consider the following scalar convex optimization problem: . Let the maximum absolute value in be less than or equal to , let for all , and let be -Lipschitz in ; hold for some positive constants .
We have the following results for different forms of :
(i) Let Assumption8 hold. For TV distance i.e. , we have , hence , , and .
(ii) For chi-square divergence i.e. , we have , hence , , and .
(iii) For KL divergence i.e. , we have , hence , , and .
(iv) Fix . For -CVaR i.e. , we have , hence , , and .
Proof.
We first prove the statement for TV distance with . From -divergence literature (Xu∗ et al.,, 2023), we know
Thus, we have
(17)
where follows by definition of , by Assumption8, by the fact for any , and follows by making the substitution . Finally, for , notice that since holds when . So is achieved at .
We immediately have since . Since and , we further get . For , from the fact we have . This proves statement .
We now prove the statement for chi-square divergence with following similar steps as before. From -divergence literature (Xu∗ et al.,, 2023), we know
Thus, we have
where follows by making the substitution . Finally, for , observe that the function is convex in the dual variable and since it is a Lagrangian dual variable. Since where is any solution of . When , notice that since .
We immediately have since . Since and , we further get . For , from the facts and , we have . This proves statement .
We now prove the statement for KL divergence with following similar steps as before. From -divergence literature (Xu∗ et al.,, 2023), we know
Thus, we have
where follows by making the substitution . Finally, for , observe that the function is convex in the dual variable since it is a Lagrangian dual variable. From Calculus, the optimal . So since .
We immediately have since . Since and , we further get . For , from the fact is -Lipschitz for , we have . This proves statement .
We now prove the statement for -CVAR with . From -divergence literature (Levy et al.,, 2020), we know
Thus, we have
(18)
For , notice that since holds when . Also, since holds when .
We immediately have . We further get . For , from the fact we have .
This proves the final statement of this result.
∎
We now state and prove a generalization bound for empirical risk minimization when the data are not necessarily i.i.d. but adapted to a stochastic process. This result is of independent interest to more machine learning problems outside of the scope of this paper as well.
Furthermore, this result showcases better rate dependence on , from to , than the classical result Lemma4 (Shalev-Shwartz and Ben-David,, 2014).
This result is not surprising and we refer to Van Erven et al., (2015, Theorems 7.6 & 5.4), in the i.i.d. setting, for such fast rates with bounded losses to empirical risk minimization and beyond.
Proposition 4(Online ERM Generalization Bound).
Let , , let be an input space, and let be the target functional space. Let be the given finite class of functions. Assume that for all and for loss function we have that for some positive constant . Given a dataset , denote as the ERM solution, i.e. .
The dataset is generated as from some stochastic process that depends on the history , where the function satisfies approximate realizability i.e.
and for all , .
Then, the ERM solution satisfies
with probability at least .
Proof.
We adapt the proof of least-squares generalization bound (Song et al.,, 2023, Lemma 3) here for the empirical risk minimization generalization bound under online data collection.
Fix any function . We define the random variable Immediately, we note for all .
By definition of , we have a non-negative first moment of :
(19)
By symmetrization, assuming , we have that
Similarly assuming , we get . Thus, uniformly, we have
(20)
We remark that (20) is called Bernstein condition (Van Erven et al.,, 2015, Definition 5.1) when all sampling distributions ’s are identical. This is one of the sufficient conditions on the loss functions to get -generalization bounds for empirical risk minimization.
with probability at least , where the last inequality uses (19) and (20). We set in the above, we get for any , with probability at least :
by union bound over . Using (19), we rearrange the above to get:
(21)
and
(22)
Define the function , which is independent of the dataset . By (21) for and the approximate realizability assumption, we get
By definitions of and the ERM function , we have that
From the above two relations, we get
Now, using this and using (22) for the function , we get
which holds with probability at least . This completes the proof.
∎
We now state a useful result for an infinite-horizon discounted robust -regularized Markov decision process . This result helps our RPQ algorithm’s policy search space to be the class of deterministic Markov policies.
and the value function operator are both -contraction operators w.r.t sup-norm. Moreover, their respective unique fixed points and , for optimal policy , achieve the optimal robust value . Furthermore, the robust regularized optimal policy is a deterministic Markov policy satisfying .
Proof.
The -contraction property of both operators directly follow from the fact . Furthermore, this result is a direct corollary of (Yang et al.,, 2023, Proposition 3.1) and (Iyengar,, 2005, Corollary 3.1).
∎
We now state a similar result for a finite-horizon discounted robust -regularized Markov decision process .
This result helps our HyTQ algorithm’s policy search space to be the class of non-stationary deterministic Markov policies.
Proposition 6.
The robust regularized Bellman operator (10) and the value function operator are as follows:
The optimal robust value satisfies the following robust dynamic programming procedure: Starting with , doing backward iteration of , i.e., , we get for all .
Furthermore, the robust regularized optimal policy is a non-stationary deterministic Markov policy satisfying for all where
Moreover, as , it suffices to backward iterate , i.e., do to get for all .
Proof.
We start with the optimal robust value definition . The value function claims in this statement are direct consequences of (Iyengar,, 2005, Theorem 2.1 & 2.2) and (Zhang et al.,, 2023, Theorem 2) with the reward function .
It remains to prove dynamic programming with . That is, we establish for all with the dynamic programming of .
We use induction to prove this. The base case is trivially true since . By , we have
where the last equality follows by the induction hypothesis . Maximizing this both sides with action and by the dynamic program , we get . This completes the proof of this result.
∎
Appendix D Offline Robust -regularized RL Results ☕☕☕
In this section, we set whenever we use results from Proposition3. In the following, we use constants from Proposition3.
Since the conjugate function is continuous, define a continuous function in for each . We observe in is -measurable for each , where is a bounded real line.
This lemma now directly follows by similar arguments in the proof of Panaganti et al., (2022, Lemma 1).
∎
Now we state a result and provide its proof for the empirical risk minimization on the dual parameter.
Proposition 7(Dual Optimization Error Bound).
Let be the dual optimization parameter from Algorithm1 (Step 4) for the state-action value function and let be as defined in (7). With probability at least , we have
Proof.
We adapt the proof from Panaganti et al., (2022, Lemma 6).
We first fix . We will also invoke union bound for the supremum here. We recall from (8) that . From the robust Bellman equation, we directly obtain
follows since . follows from Proposition2. follows from the approximate dual realizability assumption (Assumption 3).
For , we consider the loss function (for e.g. ) and dataset . Since and , we note that , where the value of depend on specific forms of as demonstrated in Proposition3.
Furthermore, take to be -Lipschitz in and , since , for some positive constants and . Again, these constants depend on specific forms of as demonstrated in Proposition3.
With these insights, we can apply the empirical risk minimization result in Lemma 4 to get .
With union bound, with probability at least , we finally get
which concludes the proof.
∎
We next prove the least-squares generalization bound for the RFQI algorithm.
Let be the least-squares solution from Algorithm1 (Step 5) for the state-action value function and dual variable function . Let be as defined in (7). Then, with probability at least , we have
Proof.
We adapt the least-squares generalization bound given in Agarwal et al., (2019, Lemma A.11) to our setting. We recall from (9) that . We first fix functions and . For any function , we define random variables as
where , and with . It is straightforward to note that for a given , we have . We note the randomness of given and is from the dataset pairs .
Since and , from Proposition3, we write both , where the value of depend on specific forms of .
Using this, we obtain the first moment and an upper-bound for the second moment of as follows:
where . This immediately implies that
From these calculations, it is also straightforward to see that almost surely.
Now, using the Bernstein’s inequality (Lemma 2), together with a union bound over all , with probability at least , we have
(23)
for all .
This expression coincides with Panaganti et al., (2022, Eq.(15)). Thus, following the proof of Panaganti et al., (2022, Lemma 7), we finally get
(24)
We note a fact . Now, using union bound for and , with probability at least , we finally obtain
This completes the least-squares generalization bound analysis for the robust regularized Bellman updates.
∎
Let Assumptions1, 2 and 3 hold. Let be the RPQ algorithm policy after iterations. Then, for any , with probability at least , we have
Proof.
We let for every . Since is the greedy policy w.r.t , we also have . We recall that and . We also recall from Section2 that is a fixed-point of the robust Bellman operator defined in (3). We also note that the same holds true for any stationary deterministic policy from Yang et al., (2023) that satisfies
We now adapt the proof of Panaganti et al., (2022, Theorem 1) using the RRBE in its primal form (3) directly instead of its dual form (4).
We first characterize the performance decomposition between and . We recall the initial state distribution . Since for any , we observe that
(25)
where follows from the fact that is the greedy policy with respect to , from the Bellman equations, and from the following definition
We note that this worse-case model distribution can be non-unique and we just pick one by an arbitrary deterministic rule. We emphasize that this model distribution is used only in analysis which is not required in the algorithm.
Finally, follows with telescoping over by defining a state distribution , for all natural numbers , as
We note that such state distribution proof ideas are commonly used in the offline RL literature (Agarwal et al.,, 2019; Panaganti et al.,, 2022; Bruns-Smith and Zhou,, 2023; Zhang et al.,, 2023).
For (25), with the -norm notation i.e. for any , we have
(26)
where the state-action distributions are and .
We now analyze the above two terms treating either or as a state-action distribution satisfying Assumption 1. First, considering any satisfying we have
(27)
where follows by the concentrability assumption (Assumption 1), from Bellman equation, operator ,
follows, similarly as step , from the following definition
We again emphasize that this model distribution is analysis-specific and we just pick one by an arbitrary deterministic rule since it may not be unique. follows by the fact . Now, by replacing with in step and repeating the steps for any satisfying , we get
(28)
We immediately note that both and satisfies and , which follows by their definition and the facts , .
Define the state-action probability distribution as, for any ,
where follows since , and follows since is the dual variable function from the algorithm for the state-action value function and as the least squares solution from the algorithm for the state-action value function and dual variable function pair.
Now, using Lemma 7 and Lemma 8 to bound (29), and then combining it with (26), completes the proof of this theorem.
∎
D.2 Specialized Result for TV -divergence ☕☕☕
We now state and prove the improved (in terms of assumptions) result for TV -divergence.
Assumption 9(Concentrability).
There exists a finite constant such that for any for any policy (can be non-stationary as well), we have .
Assumption 10(Fail-state).
There is a fail state such that and , for all and satisfying for all .
Theorem 4.
Let Assumptions9, 2, 3 and 10 hold. Let be the RPQ algorithm policy after iterations. Then, for any , with probability at least , we have
with .
Proof.
We can now further use the dual form (4) under Assumption 10. We again start by characterizing the performance decomposition between and . This proof largely follows the proofs of Theorem1 and Panaganti et al., (2022, Theorem 1).
In particular, we use the total variation RRBE its dual form (4) under Assumption 10 in this proof. That is, for all and , from (17) we have
(30)
We recall the initial state distribution . Since for any , we begin with step in Theorem1:
(31)
where follows from (30) and the fact , follows from the facts and for any .
We make an important note here in step regarding the dependence on the nominal model distribution unlike in step in the proof of Theorem1. This important step helps us improve the concentrability assumption in further analysis.
Finally, follows with telescoping over by defining a new state distribution , for all natural numbers , as
For (31), with the -norm notation i.e. for any , we have
(32)
where the second inequality follows since both and satisfy Assumption 9. We now analyze the summand in (26):
where follows by Assumption 9, from Eq.30 and the fact ,
from the fact ,
follows by Jensen’s inequality and by the facts and , follows by defining the distribution as , and using the fact that .
The rest of the proof follows similarly as in the proof of Theorem1.
∎
Appendix E Hybrid Robust -regularized RL Results ☕☕☕☕
In this section, we set whenever we use results from Proposition3. We remark that we have attempted to optimize the absolute constants inside factors of the performance guarantees. In the following, we use constants from Proposition3.
Now we provide an extension of Proposition7 using Proposition4 when the data comes from adaptive sampling.
Fix . For , , let be the dual optimization function from Algorithm2 (Step 4) for the state-action value function using samples in the dataset .
Let be as defined in (14) and let . Then, with probability at least , we have
Proof.
Fix , , . The algorithm solves for in the empirical risk minimization step as:
where dataset with .
The first samples in are (recall that these are generated by the offline state-action distribution ), the next samples are (recall that these are generated by the state-action distribution ), and so on where the samples (recall that these are generated by the state-action distribution ) for all .
We first have the following from step (b) in the proof of Proposition7:
where follows by defining the corresponding true solutions for all .
For with the empirical risk minimization solution , we use Proposition4 by setting (with , constant dependent on and , from Proposition3) and
since with sizes and under the union bound.
Taking a union bound over , , and bounding each term separately, completes the proof.
∎
Now we provide an extension of Proposition8 using Lemma7 when the data comes from adaptive sampling.
Fix . For , , let be the least-squares solution from Algorithm2 (Step 5) for the state-action value function and dual variable function using samples in the dataset .
Let be as defined in (14) and let . Then, with probability at least , we have
Proof.
We adapt the proof of Song et al., (2023, Lemma 7) here.
Fix , , , and . The algorithm solves for in the least-squares regression step as:
where dataset with and
The first samples in are (recall that these are generated by the offline state-action distribution ), the next samples are (recall that these are generated by the state-action distribution ), and so on where the samples (recall that these are generated by the state-action distribution ) for all .
For using Lemma7, we first note for any sample in with and , there exists some by Assumption5 such that the following holds:
We also note for any sample in , (with , constant dependent on and , from Proposition3) and for all .
With these notes, applying Lemma7, we get that the least square regression solution satisfies
with probability at least , since and with sizes and under the union bound.
Recall the samples in are independently and identically drawn from the offline distribution , and the samples in are independently and identically drawn from the state-action distribution . Thus we can further write as
Taking a union bound over , , bounding each term separately, and using the fact , completes the proof.
∎
Let Assumptions4, 5, 6, 7 and 8 hold and fix any . Then, HyTQ algorithm policies satisfy
with probability at least .
Proof.
We let for every . Since is the greedy policy w.r.t , we also have . We recall that and . We also note that the same holds true for any stationary Markov policy from (Zhang et al.,, 2023) that satisfies We can now further use the dual form (4) under Assumption 8, that is, for all and ,
(33)
We first characterize the performance decomposition between and . We recall the initial state distribution . Since for any , we observe that
(34)
We rewrite the state-action distribution , dropping , as for simplicity. Letting also denote a state distribution (), we can write it as, for all ,
(35)
Analyzing one term in of (34) starting with the facts that is the greedy policy with respect to and function is non-decreasing in :
(36)
where follows by triangle inequality for operation, from Bellman equation, operator , and the fact ,
from the fact for any ,
follows by Jensen’s inequality and by definitions of policies and . Now, recursively applying this method for first term over horizon in (36) we get
(37)
where the last inequality holds since for all and .
Recall
Now, using (37) in of (34), the following holds with probability at least :
(38)
where follows from definition of in Assumption4, from triangle inequality and the fact , and follows from Propositions9 and 10.
For , firstly we note . So, following the same analysis as in , we get
(39)
where the last inequality follows by triangle inequality for operation.
Finally, follows by the fact , follows from Proposition10, and follows from Lemma6.
Now recall bilinear model from Assumption7: .
Following analysis above in (41) for the second part of (40) using Assumption7 and
Proposition9,
the following holds with probability at least :
Finally, choosing higher order terms by setting and , we get
The proof is now complete.
∎
E.2 HyTQ Algorithm Specialized Results ☕☕☕
In this section we specialize our main result Theorem2 for different bilinear model classes and also provide an equivalent sample complexity guarantee in the offline robust RL setting.
Before we move ahead, we showcase an important property of our robust transfer coefficient for any fixed policy. Fixing a nominal model , the transfer coefficient considers the distribution shift w.r.t the data-generating distribution along the general function class which the algorithm uses. It is in fact smaller than the existing density ratio based concentrability assumption (Assumption9). We state this result in the following lemma.
The concentrability assumption (Assumption9) is in fact the same non-robust RL concentrability assumption (Munos and Szepesvári,, 2008; Chen and Jiang,, 2019).
We make two important points here. Firstly, our transfer coefficient is larger than the transfer coefficient (Song et al.,, 2023, Definition 1) using the fact . Secondly, our transfer coefficient is not directly comparable with the l2-norm version transfer coefficient (Xie et al.,, 2021, Definition 1). It is an interesting open question for future research to investigate about minimax lower bound guarantees w.r.t different transfer coefficients for both non-robust and robust RL problems.
We now define a bilinear model called
Low Occupancy Complexity (Du et al.,, 2021, Definition 4.7).
The nominal model and realizable function class has low occupancy complexity w.r.t., for each , a (possibly unknown to the learner) feature map , where is a Hilbert space, and w.r.t. to a (possibly unknown to the learner) map such that for all , with greedy policy w.r.t. , and we have
(43)
We make the following assumption on the offline data-generating distribution (or policy by slight notational override for convenience).
Assumption 11.
Consider the Low Occupancy Complexity model (bilinear model) on . Let the offline data distribution satisfy a low rank structure, i.e.
, for some .
Now we extend our main result Theorem2 in this next result specializing to the Low Occupancy Complexity (43) bilinear model.
Corollary 3(Cumulative Suboptimality of Theorem2 in Low Occupancy Complexity (43) bilinear model).
Consider the Low Occupancy Complexity (43) bilinear model. Let Assumptions4, 5, 6 and 8 hold and fix any . Then, HyTQ algorithm policies satisfy
with probability at least . Now, consider the offline data distribution as in Assumption11 with perfect robust Bellman completeness, i.e. . We have
Proof.
Using the Low Occupancy Complexity (43) bilinear model, we have , where
We also have , where
Furthermore, we set . Since is realizable and is complete, we set . Then the result directly follows by Theorem2.
For the second statement, first note that the occupancy is low-rank as well since we assume perfect Bellman completeness.
Following the proof of Lemma8 we get
where follows from the Mediant inequality.
This completes the proof.
∎
We now define a bilinear model called
Low-rank Feature Selection Model (Du et al.,, 2021, Definition A.1).
The nominal model is a low-rank feature selection model if it satisfies , for each and all , with a (possibly unknown to the learner) map and a (possibly unknown to the learner) map , where is a Hilbert space.
This model specializes to the kernel MDP model when the map is known to the learner (Jin et al., 2021a, , Definition 30).
This model also specializes to the low-rank MDP model when (Huang et al.,, 2023, Assumption 1) and furthermore to linear MDP model when the map is also known to the learner (Du et al.,, 2021, Definition A.4).
We make the following assumption on the offline data-generating distribution (or policy by slight notational override for convenience).
Assumption 12.
Consider the Low-rank MDP Model (bilinear model). Let the offline data distribution satisfy and suppose that is induced by the nominal model, i.e. (starting state distribution) and for any . Furthermore, suppose that satisfies that the feature covariance matrix is invertible for all and for at least one and all .
Now we extend our main result Theorem2 in this next result specializing to the Low-rank Feature Selection Model bilinear model.
Corollary 4(Cumulative Suboptimality of Theorem2 in Low-rank Feature Selection Model (bilinear model)).
Consider the Low-rank Feature Selection Model (bilinear model). Let Assumptions4, 5, 6 and 8 hold and fix any .
Then, HyTQ algorithm policies satisfy
with probability at least . Now, consider the offline data distribution as in Assumption12 with a low-rank MDP model. We have
Proof.
We first begin with establishing a Q-value-dependent linearity property for the state-action-visitation measure . To do this, we adapt the proof of Huang et al., (2023, Lemma 17) here.
We start by writing the state-visitation measure by recalling Eq.35 here:
where follows by the low-rank feature selection model definition, and the last equality follows by taking a functional .
Since we consider the finite action space with possibly large state space setting for our results, the state-action visitation measure for the deterministic non-stationary policy is now given by with for features . Here is a normalizing constant such that the state-action visitation measure is a probability measure.
We now have , where
We also have , where
Furthermore, we set
Since is realizable and is complete for all , we set
Then the first result directly follows by Theorem2. Following the proof of Song et al., (2023, Lemma 13) for our transfer coefficient , with the facts for and for all , the last statement for follows. This completes the proof.
∎
Now we extend our main result Theorem2 in this next result to showcase sample complexity for comparisons with offline+online RL setting.
Corollary 5(Offline+Online RL Sample Complexity of the HyTQ algorithm).
Let Assumptions4, 5, 6, 7 and 8 hold. Fix any and any , and let be the total number of sample tuples used in HyTQ algorithm. Then, the uniform policy (uniform convex combination) of HyTQ algorithm policies satisfy,
with probability at least ,
Proof.
This proof is straightforward from the Theorem2 using a standard online-to-batch conversion (Shalev-Shwartz and Ben-David,, 2014, Theorem 14.8 & Chapter 21). Define the policy . From Theorem2, we get
We recall that our algorithm uses number of offline samples and number of on-policy samples in the datasets for all . Since we set and , the total number of offline and on-policy samples is .
Fix any .
For approximations , we first assume there exists such that for all .
Let
Then, for , we have with probability at least . So, the total number of samples is at least :