0% found this document useful (0 votes)
1 views

2206.04282v1

Uploaded by

22520398
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

2206.04282v1

Uploaded by

22520398
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Sample-Efficient Reinforcement Learning in the Presence of

Exogenous Information
Yonathan Efroni1 , Dylan J. Foster1 , Dipendra Misra1 , Akshay Krishnamurthy1 , and John
Langford1
arXiv:2206.04282v1 [cs.LG] 9 Jun 2022

1
Microsoft Research NYC

Abstract
In real-world reinforcement learning applications the learner’s observation space is ubiquitously high-
dimensional with both relevant and irrelevant information about the task at hand. Learning from high-
dimensional observations has been the subject of extensive investigation in supervised learning and
statistics (e.g., via sparsity), but analogous issues in reinforcement learning are not well understood, even
in finite state/action (tabular) domains. We introduce a new problem setting for reinforcement learning,
the Exogenous Markov Decision Process (ExoMDP), in which the state space admits an (unknown)
factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous)
component; the exogenous component is independent of the learner’s actions, but evolves in an arbitrary,
temporally correlated fashion. We provide a new algorithm, ExoRL, which learns a near-optimal policy
with sample complexity polynomial in the size of the endogenous component and nearly independent of
the size of the exogenous component, thereby offering a doubly-exponential improvement over off-the-shelf
algorithms. Our results highlight for the first time that sample-efficient reinforcement learning is possible
in the presence of exogenous information, and provide a simple, user-friendly benchmark for investigation
going forward.

1 Introduction
Most applications of machine learning and statistics involve complex inputs such as images or text, which may
contain spurious information for the task at hand. A traditional approach to this problem is to use feature
engineering to identify relevant information, but this requires significant domain expertise, and can lead to
poor performance if relevant information is missed. As an alternative, representation learning and feature
selection methodologies developed over the last several decades address these issues, and enable practitioners
to directly operate on complex, high-dimensional inputs with minimal domain knowledge. In the context of
supervised learning and statistical estimation, these methods are particularly well-understood (Hastie et al.,
2015; Wainwright, 2019) and—in some cases—can be shown to provably identify relevant information for the
task at hand in the presence of a vast amount of irrelevant or spurious features. As such, these approaches
have emerged as the methods of choice for many practitioners.
Complex, high-dimensional inputs are also ubiquitous in Reinforcement Learning (RL) applications. However,
due to the interactive, multi-step nature of the RL problem, naive extensions of representation learning
techniques from supervised learning do not seem adequate. Empirically, this can be seen in the brittleness
of deep RL algorithms and, the large body of work on stabilizing these methods (Gelada et al., 2019; Zhang
et al., 2020). Theoretically, this can be seen by the prevalence of strong function approximation assumptions
that preclude introducing spurious features (Wang et al., 2021; Weisz et al., 2021). As a result, developing
representation learning methodology for RL is a central topic of investigation.
Recently, a line of theoretical works have developed structural conditions under which RL with complex
inputs is statistically tractable (Jiang et al., 2017; Jin et al., 2021; Du et al., 2021; Foster et al., 2021), along
with a complementary set of algorithms for addressing these problems via representation learning (Du et al.,
2019; Misra et al., 2020; Agarwal et al., 2020; Misra et al., 2021; Uehara et al., 2021). While these works

1
provide some clarity into the challenges of high-dimensionality in RL, the models considered do not allow
for spurious, temporally correlated information (e.g., exogenous information that evolves over time through
a complex dynamical system). On the other hand, this structure is common in applications; for example,
when a human is navigating a forest trail, the flight of birds in the sky is temporally correlated, but irrelevant
for the human’s decision making. Motivated by the success of high-dimensional statistics in developing and
understanding feature selection methods for supervised learning, we ask:
Can we develop provably efficient algorithms for RL in the presence of a large number of dynamic, yet
irrelevant features?
Efroni et al. (2021b) initiated the study of this question in a rich-observation setting with function approxi-
mation. However, their results require deterministic dynamics, and their approach crucially uses determinism
to sidestep many challenges that arise in the presence of exogenous information.

Our contributions. In this paper, we take a step back from the function approximation setting considered
by Efroni et al. (2021b), and introduce a simplified problem setting in which to study representation learning
and exploration with high-dimensional, exogenous information. Our model, the Exogenous Markov Decision
Process or ExoMDP, involves a discrete d-dimensional state space (with each dimension taking values in
{1, . . . , S}) in which an unknown subset of k ≪ d dimensions of the state can be controlled by the agent’s
actions. The remaining d − k state variables are irrelevant for the agent’s task, but may exhibit complex
temporal structure.
Our main result is a new algorithm, ExoRL, that learns a policy which is (i) near-optimal and (ii) does not
depend on the exogenous and irrelevant factors, while requiring only poly(S k , log(d)) trajectories. Here, the
dominant S k term represents the size of the controllable (or, endogenous) state space, and the log(d) term
represents the price incurred for feature selection (analogous to guarantees for sparse regression (Hastie et al.,
2015; Wainwright, 2019)). Our result represents a doubly-exponential improvement over naive application
of existing tabular RL methods to the ExoMDP setting, which results in poly(S d ) sample complexity. Our
algorithm and analysis involve many new ideas for addressing exogenous noise, and we believe our work may
serve as a building block for addressing these issues in more practical settings.

2 Overview of Results
In this section we introduce the ExoMDP setting and give an overview of our algorithmic results, highlighting
the key challenges they overcome. Before proceeding, we formally describe the basic RL setup we consider.

Markov decision processes. We consider a finite-horizon Markov decision process (MDP) defined by
the tuple M = (S, A, T, R, H, d1 ), in which S is the state space, A is the action space, T : S × A → ∆(S) is
the transition operator R : S × A → [0, 1] is the reward function, H ∈ N is the horizon, and d1 ∈ ∆(S) is
the initial state distribution. Given a non-stationary policy π = (π1 , . . . , πH ), where πh : S → A, an episode
in the MDP M proceeds as follows, beginning from s1 ∼ d1 : For h = 1, . . . , H: ah = πh (sh ), rh = R(sh , ah ),
and sh+1 ∼ T (· | sh , ah ). We let Eπ [·] and Pπ (·) denote the expectation and probability hP fori the trajectory
H
(s1 , a1 , r1 ), . . . , (sH , aH , rH ) when π is executed, respectively, and define J(π) = Eπ h=1 rh as the average
reward.
The objective of the learner is to learn an ǫ-optimal policy online: Given N episodes to execute a policy and
observe the resulting trajectory, find a policy π b such that J(b
π ) ≥ maxπ∈ΠNS J(π) − ǫ, where ΠNS denotes
the set of all non-stationary policies π = (π1 , . . . , πH ).

2.1 The Exogenous MDP (ExoMDP) Setting


The ExoMDP is a Markov decision process in which the state space factorizes into an endogenous component
that is (potentially) affected by the learner’s actions, and an exogenous component that is independent of
the learner’s actions, but evolves in an arbitrary, temporally correlated fashion. Formally, given a parameter
d ∈ N (the number of factors), the state space S takes the form S = ⊗di=1 Si , so that each state s ∈ S has

2
the form s = (s1 , . . . , sd ), with si ∈ Si ; we refer to Si (equivalently, i) as the ith factor. We take I⋆ ⊂ [d] to
represent the endogenous factors and I⋆c := [d] \ I⋆ to represent the exogenous factors, which are unknown
to the learner. Letting s[I] := (si )i∈I , we assume the dynamics and rewards factorize across the endogenous
and exogenous components as follows:

T (s′ | s, a) = Ten (s′ [I⋆ ] | s [I⋆ ] , a) · Tex (s′ [I⋆c ] | s [I⋆c ]),
R(s, a) = Ren (s [I⋆ ] , a), (1)
d1 (s) = d1,en (s [I⋆ ]) · d1,ex (s [I⋆c ]) ,

for all s, s′ ∈ S and a ∈ A. That is, the endogenous factors I⋆ are (potentially) affected by the agent’s actions
and are sufficient to model the reward, while the exogenous factors I⋆c evolve independently of the learner’s
actions and do not influence the reward.
In this paper, we focus on a finite-state/action (tabular) variant of the ExoMDP setting in which Si = [S]
and A = [A], with S ∈ N representing the number of states per factor and A ∈ N representing the number
of actions. We assume that |I⋆ | ≤ k, where k ≪ d is a known upper bound on the number of endogenous
factors.1 In the absence of the structure in Eq. (1), this is a generic tabular RL problem with |S| = S d ,
and the optimal sample complexity scales as poly(S d , A, H, ǫ−1 ) (Azar et al., 2017), which has exponential
dependence on the number of factors d. On the other hand, if I⋆ were known a-priori, applying off-the-shelf
algorithms for tabular RL to the endogenous subset of the state space would lead to sample complexity
poly(S k , A, H, ǫ−1 ) (Azar et al., 2017; Jin et al., 2018; Zanette and Brunskill, 2019; Kaufmann et al., 2021),
which is independent of d and offers significant improvement when k ≪ d. This motivates us to ask: With
no prior knowledge, can we learn an ǫ-optimal policy for the ExoMDP with sample complexity polynomial in
S k and sublinear in d?

2.2 Challenges of RL in the Presence of Exogenous Information


Sample-efficient learning in the absence of prior knowledge poses significant algorithmic challenges.
(C1) Hardness of identifying endogenous factors. In general, the endogenous factors may not be identifiable
(that is, multiple choices for I⋆ may obey the structure in Eq. (1)). Even when I⋆ is identifiable,
certifying whether a particular factor i ∈ [d] is exogenous can be statistically intractable (e.g., if the
effect of the agent’s action on the state component si is small relative to ǫ).
(C2) Necessity of exploration. The agent’s action might have a large effect on an endogenous factor i ∈ I⋆ ,
but only in a particular state s ∈ S that requires deliberate planning to reach. As such, any approach
that attempts to recover the endogenous factors must be interleaved with exploration, resulting in a
chicken-and-egg problem. “Test-then-explore” approaches do not suffice.
(C3) Entanglement of endogenous and exogenous factors. The factorized dynamics in (1) lead to a number
of useful structural properties for ExoMDPs, such as factorization of state occupancy measures (cf.
Appendix B). However, these properties generally only hold for policies that act on the endogenous
portion of the state. When an agent executes a policy whose actions depend on the exogenous state
factors, the evolution of the endogenous and exogenous components becomes entangled. This entangle-
ment makes it difficult to apply supervised learning or estimation methods to extract information from
trajectories gathered from such policies, and can lead to error amplification. As a result, significant
care is required in gathering data.

Failure of existing algorithms. Existing RL techniques do not appear to be sufficient to address the
challenges above and generally have sample complexity requirements scaling with Ω(d) or worse. For example,
tabular methods do not exploit factored structure, resulting in Ω(S d ) sample complexity, and we can show
that complexity measures like the Bellman rank (Jiang et al., 2017) and its variants scale as Ω(d), so they do
not lead to sample-efficient learning guarantees. Moreover, algorithms for factored MDPs (e.g., Rosenberg
and Mansour (2020)) obtain guarantees that depend on sparsity in the transition operator, but this operator
1 Extending our results to settings in which different factors have different sizes (i.e., Si = [Si ]) is straightforward.

3
is dense in the ExoMDP setting, leading to sample complexity that is exponential in d. See further discussion
in Section 5 and Appendix B.1.

2.3 Main Result


We present a new algorithm, ExoRL, which learns a near-optimal policy for the ExoMDP with sample
complexity polynomial in the number of endogenous states and logarithmic in the number of exogenous
components. Following previous approaches to representation learning in RL (Du et al., 2019; Misra et al.,
2020; Agarwal et al., 2020), our results depend on a reachability parameter.
Definition 2.1. The endogenous state space is η-reachable if for all h ∈ [H] and s[I⋆ ] ∈ S[I⋆ ], either

max Pπ (sh [I⋆ ] = s[I⋆ ]) ≥ η, or max Pπ (sh [I⋆ ] = s[I⋆ ]) = 0.


π∈ΠNS π∈ΠNS

Crucially, this notation of reachability considers only the endogenous portion of the state space, not the full
state space. We assume access to a lower bound η on the optimal reachability parameter.
Our main result is as follows.
Theorem 4.1 (informal). With high probability, ExoRL learns an ǫ-optimal policy for the ExoMDP using
poly(S k , A, H, log(d)) · ǫ−2 + η −2 trajectories.
This constitutes a doubly-exponential improvement over the S d sample complexity for naive tabular RL
in terms of dependence on the number of factors d, and it provides a RL analogue of sparsity-dependent
guarantees in high-dimensional statistics (Hastie et al., 2015; Wainwright, 2019). Importantly, the result
does not require any statistical assumptions beyond the factored structure in Eq. (1) and reachability (for
example, we do not require deterministic dynamics). Beyond polynomial factors, the dependence on the size
of the state space cannot be improved further.

2.4 Our Approach: Exploration with a Certifiably Endogenous Policy Cover


ExoRL is built upon the notion of an endogenous policy cover. Define an endogenous policy as follows.
Definition 2.2 (Endogenous policy). A policy π = (π1 , . . . , πH ) is endogenous if it acts only on the endoge-
nous component of the state space: For all h ∈ [H] and s ∈ S, we have πh (s) = πh (s[I⋆ ]).
An endogenous policy cover is a (small) collection of endogenous policies that ensure each state is reached
with near-maximal probability.
Definition 2.3 (Endogenous policy cover). A set of non-stationary policies Ψ is an endogenous (ǫ-approximate)
policy cover for timestep h if:
1. For all s ∈ S, maxψ∈Ψ Pψ (sh [I⋆ ] = s[I⋆ ]) ≥ maxπ∈ΠNS Pπ (sh [I⋆ ] = s[I⋆ ]) − ǫ.
2. The set Ψ contains only endogenous policies.
While the coverage property of Definition 2.3 is stated in terms of occupancy measures for the endogenous
portion of the state space, the factored structure of the ExoMDP implies that this yields a cover for the
entire state space (cf. Appendix B.2):

max Pψ (sh = s) ≥ max Pπ (sh = s) − ǫ, ∀s ∈ S.


ψ∈Ψ π

In particular, even though |S| = S d , this guarantees that for each timestep h, there exists a small endogenous
policy cover with |Ψ| ≤ S k . ExoRL constructs such a policy cover and uses it for sample-efficient exploration
in two phases. First, in Phase I (OSSR), the algorithm builds the policy cover in a manner guaranteeing
endogeneity; this accounts for the majority of the algorithm design and analysis effort. Then, in Phase II
(ExoPSDP), the algorithm uses the policy cover to optimize rewards.

4
Finding a certifiably endogenous policy cover: OSSR. The main component of ExoRL is an algorithm,
OSSR, which iteratively learns a sequence of endogenous policy covers Ψ(1) , . . . , Ψ(H) with

max Pψ (sh [I⋆ ] = s[I⋆ ]) ≥ max Pπ (sh [I⋆ ] = s[I⋆ ]) − ǫ


ψ∈Ψ(h) π

for all s[I⋆ ] ∈ S[I⋆ ]. For each h ∈ [H], given the policy covers Ψ(1) , . . . , Ψ(h−1) for preceding timesteps, OSSR
builds the policy cover Ψ(h) using a novel statistical test. The test constructs a factor set I ⊂ [d] which is
(i) endogenous, in the sense that I ⊂ I⋆ , yet (ii) ensures sufficient coverage, in the sense that there exists
a near-optimal policy cover operating only on s[I]. The analysis of this test relies on a unique structural
property of the ExoMDP setting called the restriction lemma (Lemma B.2), which provides a mechanism to
“regularize” the factor set under consideration toward endogeneity in a data-driven fashion.
This approach circumvents challenges (C1) and (C2): It does not rely on explicit identification of the
endogenous factors and instead iteratively builds a subset of factors that is certifiably endogenous, but
nonetheless sufficient to explore. Endogeneity of the resulting policy cover Ψ(h) ensures the success of
subsequent tests at rounds h + 1, . . . , H, and circumvents the issue of entanglement raised in challenge (C3).
To summarize, the following guarantee constitutes our main technical result.
η
Theorem 3.1 (informal).  With high probability, OSSR finds an endogenous 2 -approximate policy cover
using poly S k , A, H, log(d) · η −2 trajectories.

2.5 Organization
The remainder of the paper is organized as follows. In Section 3, we introduce the OSSR algorithm, highlight
the key algorithm design techniques and analysis ideas, and state its formal guarantee (Theorem 3.1) for
finding a policy cover. Building on this result, in Section 4 we introduce the ExoRL algorithm, and provide
the main sample complexity guarantee for RL in ExoMDPs (Theorem 4.1). We close with discussion of
additional related work (Section 5) and open problems (Section 6).

2.6 Preliminaries
We let Π denote the set of all one-step policies π : S → A. We use the term t → h policy to refer to a
non-stationary policy π = (πt , . . . , πh ) defined over a subset of timesteps t ≤ h.
Forh a non-stationary policy πi ∈ ΠNS , we define the state-action and state value functions: Qπh (s, a) :=
PH π π
Eπ h′ =h rh′ | sh = s, ah = a , and Vh (s) := Qh (s, πh (s)). We denote the expected value of a policy π
hP i
h
from time step t to h by Vt,h (π) := Eπ t′ =t rt′ . We adopt the shorthand dh (s ; π) := Pπ (sh = s) for the
induced state occupancy measure. Likewise, for I ⊆ [d], we define dh (s[I] ; π) := Pπ (sh [I] = s[I]).
For algorithm design purposes, we consider mixture policies of the form µ ∈ Πmix := ∆(ΠNS ). To run
a mixture policy µ ∈ Πmix , we sample π ∼ µ, then execute π for an entire episode. We further denote
Πmix [I] := ∆(ΠNS [I]) as the set of mixture policies over the policy set ΠNS [I], where ΠNS [I] denotes the set
of policies that act on the factor set I. We let Eµ [·] and
hPPµ (·) denote
i the expectation and probability under
H
this process, and we define J(µ) = Eπ∼µ [J(π)] = Eµ h=1 hr and dh (s ; µ) := Pµ (sh = s) analogously.
We say that µ ∈ Πmix is endogenous if it is supported over endogenous policies in ΠNS . Finally, for µ ∈ Πmix
and π ∈ Π we let µ ◦t π be the policy that follows µ for the first t − 1 timesteps, and at the tth timestep it
switches to π. For sets of policies Ψ1 and Ψ2 we let Ψ1 ◦t Ψ2 := {ψ1 ◦t ψ2 | ψ1 ∈ Ψ1 , ψ2 ∈ Ψ2 }.

ExoMDP notation. Recall that for a factor set I ⊆ [d], we define I c := [d] \ I as the complement, and
define s [I] := (si )i∈I and S [I] := ⊗i∈I Si as the corresponding components of the state and state space.
We make frequent use of the fact that for any pair of factors I1 and I2 with I = I1 ∪ I2 and I1 ∩ I2 = ∅,
any state s[I] ∈ S[I] can be uniquely split as s[I] = (s[I1 ], s[I2 ]), with s[I1 ] ∈ S[I1 ] and s[I2 ] ∈ S[I2 ]. We
use a canonical ordering when indexing with factor sets.

5
Any factor set I ⊆ [d] can be written as I = (I ∩ I⋆ )∪(I ∩ I⋆c ). We denote these intersections by Ien := I ∩I⋆
and Iex := I ∩ I⋆c , which represent the endogenous and exogenous components of I.
We say that a policy π acts on a factor set I if it selects actions as a measurable function of S[I]. We let
Π[I] denote the set of all one-step policies π : S[I] → A that act on I, and let ΠNS [I] denote the set of all
non-stationary policies that act on I.
Lastly, if I ⊆ I⋆c , i.e., the factor I is a subset of the exogenous factors, we omit the dependence in the
policy π from its occupancy measure, dh (s[I] ; π) = dh (s[I]). Indeed, for any π, π ′ ∈ ΠNS it holds that
dh (s[I] ; π) = dh (s[I] ; π ′ ), and hence the occupancy measure of s[I] is independent of the policy.

Collections of factor sets. For a factor set I ⊆ [d], we let I≤k (I) := {I ′ ⊆ [d] | I ⊆ I ′ , |I ′ | ≤ k}
denote a collection of all factor sets of size at most k that contain I, and analogously define Ik (I) :=
{I ′ ⊆ [d] | I ⊆ I ′ , |I ′ | = k}. We adopt the shorthand I≤k := I≤k (∅) and Ik := Ik (∅). With some
abuse of notation, for a given collection of factor sets I , we define Π[I ] := ∪I∈I Π[I] as the set of all
possible policies induced by factors in I .
We define [N ] := {1, 2, · · · , N }. Unf(X ) denotes the uniform distribution over a finite set X .

3 Learning a Near-Optimal Endogenous Policy Cover: OSSR


In this section, we present the first of our main algorithms, OSSR (Algorithm 8), which performs reward-free
exploration to construct an endogenous policy cover for the ExoMDP. OSSR constitutes the main algorithmic
component of ExoRL, and we believe it is of independent interest.
OSSR is a forward-backward algorithm. For each layer h ∈ [H], given previous policy covers Ψ(1) , . . . , Ψ(h−1) ,
the algorithm constructs an endogenous policy cover Ψ(h) in a backwards fashion. Backward steps proceed
from t = h − 1, . . . , 1, with each step consisting of (i) an optimization phase, in which we find a (potentially
large) collection of policies for choosing actions at step t that lead to good coverage for all possible target
factors sets I at layer h, and (ii) a selection phase, in which we narrow the collection of policies from the
first phase down to a small set of policies that act on a single (endogenous) factor set I, yet still ensure
coverage for all states at step h.
Instead of directly diving into OSSR, we build up to the algorithm through two warm-up exercises:
• In Section 3.1, we consider a simplified version of OSSR (OSSR.OneStep, or Algorithm 1) which com-
putes an endogenous policy cover under the assumption that (i) H = 2, and (ii) certain occupancy
measures for the underlying ExoMDP can be computed exactly.
• Building on this result, in Section 3.2 we provide another simplified algorithm (OSSR.Exact, or Algo-
rithm 2) which computes an endogenous policy cover for general H, but still requires exact access to
certain occupancy measures for the ExoMDP.
Finally, in Section 3.3 we present the full OSSR algorithm and its main sample complexity guarantee.

3.1 Warm-Up I: Finding an Endogenous Policy Cover with Exact Queries (H =


2)
Algorithm 1 presents OSSR.OneStep, a simplified version of OSSR that computes a (small) endogenous policy
cover for horizon two, assuming exact access to the state occupancies d2 (s ; π). This algorithm highlights
the mechanism through which OSSR is able to simultaneously ensure both endogeneity and coverage.
OSSR.OneStep learns an endogenous policy cover in two phases. In the optimization phase (Lines 1 and 2)
the algorithm computes a partial policy cover Γ[J ] for each factor set J ∈ I≤k , which ensures that for all
state factor values s[J ] ∈ S[J ] there exists a policy πs[J ] ∈ Γ[J ] which maximizes the probability to reach
the state factor value s[J ] at the 2nd timestep.

6
Algorithm 1 OSSR.OneStep: Optimization-Selection State Refinement for ExoMDPs with H = 2
Phase I: Optimization
e ∈ I≤k with minimal cardinality such that for all J ∈ I≤k and s[J ] ∈ S[J ],
1: Find factor set I

max d2 (s[J ] ; π) = max d2 (s[J ] ; π) .


π∈Π[I≤k ] e
π∈Π[I]

2: For all J ∈ I≤k , define πs[J ] = arg maxπ∈Π[I]


e d2 (s[J ] ; π) for each s[J ] ∈ S[J ], then set


Γ [J ] := πs[J ] : s[J ] ∈ S[J ] .

Phase II: Selection


3: Find factor set Ib ∈ I≤k with minimal cardinality such that for all J ∈ I≤k and s [J ] ∈ S [J ],
 
b .
max d2 (s[J ] ; π) = d2 s[J ] ; πs[J ∩I]
π∈Π[I≤k ]


4: b Γ[I]
return I, b

e existence of such a factor set is guaranteed


All of the partial policy covers are induced by a single factor set I;
e
by Property 3.2. We show that by regularizing by cardinality, I is guaranteed to be endogenous, and so the
policy covers (Γ[J ])J ∈I≤k are endogenous as well.
S
At this point, the only issue is size: The set J ∈I≤k Γ[J ] is an exact policy cover for h = 2 (in the sense
of Definition 2.3), but its size scales as Ω(dk ),2 which makes it unsuitable for exploration. To address this
issue, the selection phase (Line 3) identifies a single endogenous factor Ib such that Γ[I] b is an endogenous
policy cover (note that choosing Γ[I⋆ ] would suffice, but I⋆ is not known to the learner). Since |Γ[I]| b ≤ Sk
by construction, this yields a small policy cover as desired.

Proposition 3.1. The pair I, b Γ[I]
b returned by OSSR.OneStep has the property that (i) Ib is endogenous
(i.e., Ib ⊆ I⋆ ), and (ii) Γ[I]
b is an endogenous policy cover for h = 2: For all s ∈ S,
 
max d2 s[I⋆ ] ; π = d2 s[I⋆ ] ; πs[I]
b , where πs[I] b
b ∈ Γ[I].
π∈Π

 
The ExoMDP transition structure further implies that maxπ∈Π d2 s ; π = d2 s ; πs[I]
b ∀s ∈ S.

Proof of Proposition 3.1. We begin by highlighting two useful structural properties of the ExoMDP;
both properties are specializations of more general results, Lemmas B.1 and B.2 (Appendix B).
Property 3.1 (Decoupling for endogenous policies). For any endogenous policy π, we have d2 (s[I] ; π) =
d2 (s[Ien ] ; π) · d2 (s[Iex ]), for all I ⊆ [d] and s ∈ S.
Property 3.2 (Restriction lemma). For all factor sets I and J , we have

max d2 (s[J ] ; π) = max d2 (s[J ] ; π) ∀s[J ] ∈ S[J ]. (2)


π∈Π[I] π∈Π[Ien ]

Property 3.2 is perhaps the most critical structural result used by our algorithms. It implies that maxπ∈Π d2 (s[J ] ; π) =
maxπ∈Π[I⋆] d2 (s[J ] ; π), which in turn implies that the optimization and selection phases of Algorithm 1
are feasible (since we can show that I⋆ is a valid choice). If Ie and Ib are endogenous, then since Ib ⊂ I⋆ the
b is a policy cover for S[I⋆ ] (by choosing J = I⋆ in Line 3 and since Ib ∩I⋆ = I).
selection rule ensures that Γ[I] b
e b
We next show that both I and I are endogenous.
2 The e also gives a policy cover, but it is even larger.
set Π[I]

7
Claim 1: Ie is endogenous. Observe that for any (potentially non-endogenous) factor set Ie = Ieen ∪ Ieex ,
Property 3.2 implies that for all J ∈ I≤k and s[J ] ∈ S[J ],

max d2 (s[J ] ; π) = max d2 (s[J ] ; π) ,


e
π∈Π[I] een ]
π∈Π[I

For any factor set Ie that satisfies the constraints in Line 1 but has Ieex 6= ∅, we can further reduce the
cardinality without violating the constraints, so the minimum cardinality solution is endogenous.
Claim 2: Ib is endogenous. Consider a (potentially non-endogenous) factor set Ib = Iben ∪ Ibex . If Ib satisfies
the constraint in Line 3, then for all J ∈ I≤k and s ∈ S, since Jen = J ∩ I⋆ ∈ I≤k ,
   
max d2 (s[Jen ] ; π) = d2 s[Jen ] ; πs[Jen ∩I]
b = d2 s[Jen ] ; πs[Jen ∩Iben ] . (3)
π∈Π[I≤k ]

Next, using Property 3.2 and Property 3.1, we have

max d2 (s[J ] ; π) = max d2 (s[J ] ; π) = max d2 (s[Jen ] ; π) · d2 (s[Jex ]).


π∈Π[I≤k ] π∈Π[I⋆ ] π∈Π[I⋆ ]

As a result, since πs[Jen ∩Iben ] satisfies


 
max d2 (s[Jen ] ; π) = d2 s[Jen ] ; πs[Jen ∩Iben ]
π∈Π[I⋆ ]

and it is an endogenous policy, we have


 
max d2 (s[J ] ; π) = d2 s[Jen ] ; πs[Jen ∩Iben ] · d2 (s[Jex ])
π∈Π[I≤k ]
   
= d2 s[J ] ; πs[Jen ∩Iben ] = d2 s[J ] ; πs[J ∩Iben ] ,

where the second relation holds by Property 3.1, applicable since πs[Jen ∩Iben ] is an endogenous policy, and
the third relatin holds since Jen ∩ Ic c
en = J ∩ Ien .

Thus, Iben satisfies the constraint in Line 3, and if Ibex 6= ∅, we can reduce the cardinality while keeping the
constraints satisfied, so the minimum cardinality solution is endogenous.

3.2 Warm-Up II: Finding an Endogenous Policy Cover with Exact Occupancies
(H ≥ 2)
Algorithm 2 describes OSSR.Exact, which extends the OSSR.OneStep method to handle ExoMDPs with
general horizon (rather than H = 2), but still requires exact access to occupancy measures. When invoked
with a layer h, OSSR.Exacth takes as input a sequence of endogenous policy covers Ψ(1) , . . . , Ψ(h−1) for layers
1, . . . , h − 1 and uses them to compute an endogenous policy cover Ψ(h) for layer h. The algorithm constructs
Ψ(h) in a backwards fashion based on the dynamic programming principle. To describe the approach in detail,
we use the notation of t → h policy cover.
Definition 3.1. For h ∈ [H] and t < h, a set of non-stationary policies Ψ is said to be a (ǫ-approximate)
t → h policy cover with respect to a roll-in policy µ ∈ Πmix if for all s ∈ S,

max dh (s[I⋆ ] ; µ ◦t ψ) ≥ max dh (s[I⋆ ] ; µ ◦t π) − ǫ.


ψ∈Ψ π∈ΠNS

If all policies in Ψ are endogenous, we say that Ψ is endogenous.


OSSR.Exacth performs a serious of “backward” steps t = h − 1, . . . , 1. In each step t, the algorithm rolls in
with the mixture policy µ(t) := Unf(Ψ(t) ) and constructs a t → h policy cover Ψ(t,h) with respect to µ(t) .
Ψ(t,h) acts on an endogenous factor set I (t,h) (with I (t,h) ⊇ I (t+1,h) ⊇ · · · ⊇ I (h,h) = ∅), and is built from

8
Algorithm 2 OSSR.Exacth : Optimization-Selection State Refinement with Exact Occupancies
1: require: Timestep h ∈ [H], policy covers {Ψ(t) }h−1
t=1 for steps 1, . . . , h − 1.
2: initialize: I (h,h) ← ∅ and Ψ(h,h) ← ∅.
3: for t = h − 1, . . . , 1 do
Phase I: Optimization
4: Let µ(t) := Unf(Ψ(t) ).
5: Find Ie ∈ I≤k with minimal cardinality such that for all J ∈ I≤k (I (t+1,h) ), s [J ] ∈ S [J ],
   
(t+1,h) (t+1,h)
max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] = max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] .
π∈Π[I≤k ] e
π∈Π[I]

(t)
// Beginning from any state at layer t, πs[J ◦
] t+1
ψ(t+1,h)

(t+1,h)
 maximizes probability that sh [J ] = s[J ].
s I
6: For each factor set J ∈ I≤k (I (t+1,h) ) and s[J ] ∈ S[J ], let
 
(t+1,h)
πs[J ] ∈ argmax dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] ,
e
π∈Π[I]


and define Γ(t) [J ] := πs[J ] : s[J ] ∈ S[J ] .
Phase II: Selection
7: Find Ib ∈ I≤k (I (t+1,h) ) with minimal cardinality s.t. for all J ∈ I≤k (I (t+1,h) ), s [J ] ∈ S [J ],
   
(t+1,h) (t) (t+1,h)
max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] = dh s[J ] ; µ (t)
◦ t π b
s[J ∩I]
◦ t+1 ψs[I (t+1,h) ]
.
π∈Π[I≤k ]

Policy composition
8: b and for each s[I (t,h) ] ∈ S[I (t,h) ] define
Let I (t,h) ← I,
(t,h) (t) (t+1,h)
ψs[I (t,h) ] := πs[I (t,h) ] ◦t ψs[I (t+1,h) ] .

 
// Recall that π (t) (t,h) ∈ Γ(t) I (t,h) and ψ(t+1,h)
(t+1,h)
∈ Ψ(t+1,h) .
n o s[I ] s[I ]
(t,h)
9: Let Ψ (t,h)
← ψs[I (t,h) ] : s[I
(t,h)
] ∈ S[I (t,h) ] .
10: return Ψ(h) := Ψ(1,h)

the next-step policy cover Ψ(t+1,h) via dynamic programming. In particular, the algorithm searches for a
collection of endogenous “one-step” policies for choosing the action at time t that—when carefully composed
with the (t + 1) → h policy cover Ψ(t+1,h) —result in a t → h policy cover. The algorithm ensures that the
factor set I (t,h) (upon which Ψ(t,h) acts) is endogenous using an optimization and selection phases analogous
to those in OSSR.OneStep.
In more detail, OSSR.Exacth satisfies the following invariants for 1 ≤ t ≤ h − 1.
(i) I (h,h) ⊆ · · · ⊆ I (t,h) ⊆ · · · ⊆ I⋆ . (“state refinement”)
(ii) The set Ψ(t,h) is an endogenous t → h policy cover with respect to µ(t) = Unf(Ψ(t) ):
(t,h)
 
dh s[I⋆ ] ; µ(t) ◦t ψs[I (t,h) ] = max dh s[I⋆ ] ; µ(t) ◦t π , ∀s[I⋆ ] ∈ S[I⋆ ].
π∈ΠNS

This implies that Ψ(h) := Ψ(1,h) is an endogenous policy cover for layer h (Definition 2.3). In what follows
we show how OSSR.Exacth uses dynamic programming to satisfy these invariants.

9
Dynamic programming. Consider step t < h − 1, and suppose that (I (t+1,h) , Ψ(t+1,h) ) satisfies invariants
(i) and (ii). Because µ(t+1) uniformly covers all states in layer t + 1 (recall Ψ(1) , . . . , Ψ(h−1) are policy covers),
(t+1,h)
the policy ψs[I (t+1,h) ] maximizes the probability that sh [I⋆ ] = s[I⋆ ], starting from any state in layer t + 1.

Hence, the Bellman optimality principle implies that to find a t → h policy to maximize this probability, it
(t+1,h)
suffices to use the policy π (t) ◦t+1 ψs[I (t+1,h) ] , where π
(t)
solves the one-step problem:
 
(t+1,h)
π (t) ∈ argmax dh s[I⋆ ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] . (4)
π∈Π[I⋆ ]

At first glance, it is not apparent whether this observation is useful, because the endogenous factor set
I⋆ is not known to the learner, which prevents one from directly solving the optimization problem in Eq.
(4). Fortunately, we can tackle this problem using a generalization of the optimization-selection approach
of OSSR.OneStep. First, in the optimization phase (Line 5 and Line 6), we compute a collection of one-
step policy covers (Γ(t) [J ])J ∈I≤k (I (t+1,h) ) , where Γ(t) [J ] consists of the policies that solve Eq. (4) with I⋆
replaced by J , for all possible choices of state in s[J ] ∈ S[J ]. Then, in the selection phase (Line 7), we
find a single factor I (t,h) ⊇ I (t+1,h) such that Γ(t) [I (t,h) ] provides good coverage (in the sense of Eq. (4)) for
all factor sets J ∈ I≤k (I (t+1,h) ) simultaneously. Both steps ensure endogeneity by penalizing by cardinality
in the same fashion as OSSR.OneStep. The success of this approach critically relies on the assumption that
the preceding policy covers Ψ(1) , . . . , Ψ(h−1) are endogenous, which ensures that the occupancy measures
induced by µ(1) , . . . , µ(h−1) factorize (due to independence of the endogenous and exogenous state factors).
To summarize:
Proposition 3.2. If Ψ(1) , . . . , Ψ(h−1) are endogenous policy covers for layers 1, . . . , h − 1, then the set Ψ(h)
returned by OSSR.OneSteph is an endogenous policy cover for layer h, and has |Ψ(h) | ≤ S k .
We do not prove this result directly, and instead refer the reader to the proof of Theorem 3.1, which proves
the sample-based version of the result using the same reasoning.

3.3 OSSR: Overview and Main Result


The full version of the OSSR algorithm (OSSRǫ,δ h ) is given in Algorithm 8 (deferred to Appendix G due
to space constraints). The algorithm follows the same template as OSSR.Exact: For each h ∈ [H], given
policy covers Ψ(1) , . . . , Ψ(h−1) , the algorithm builds a policy cover Ψ(h) for layer h in a backwards fashion
using dynamic programming. There are two differences from the exact algorithm. First, since the MDP is
unknown, the algorithm estimates the relevant occupancy measures for each backwards step using Monte
Carlo rollouts. Second, the optimization and selection phases from OSSR.Exact are replaced by error-tolerant
variants given by subroutines EndoPolicyOptimization and EndoFactorSelection (Algorithm 5 in Appendix D and
Algorithm 6 in Appendix E, respectively).
Briefly, the EndoPolicyOptimization and EndoFactorSelection subroutines are based on approximate versions of
the constraints used in the optimization and selection phase for OSSR.Exact (Line 5 and Line 7 of Algorithm 2),
but ensuring endogeneity of the resulting factors is more challenging due to approximation errors, and it no
longer suffices to simply search for the factor set with minimum cardinality. Instead, we search for factor
sets that satisfy approximate versions of Line 5 and Line 7 with an additive regularization term based on
cardinality. We show that as long as this penalty is carefully chosen as a function of the statistical error in
the occupancy estimates, the resulting factor sets will be endogenous with high probability.
The main guarantee for Algorithm 8 is as follows.
(t) h−1
Theorem 3.1 (Sample complexity of OSSR). Suppose that OSSRǫ,δ h is invoked with {Ψ }t=1 , where each
Ψ is an endogenous, η/2-approximate policy cover for layer t. Then with probability at least 1 − δ, the set
(t)

Ψ(h) returned by OSSRǫ,δ


h is an endogenous ǫ-approximate
k
 −2  policy cover for layer h, and has |Ψ | ≤ S . The
(h)

algorithm uses at most O AS 4k H 2 k 3 log dSAH


δ · ǫ episodes.
η/2,δ h−1
By iterating the process Ψ(h) ← OSSRh ({Ψ(t) }t=1 ), we obtain a policy cover for every layer.

10
Algorithm 3 ExoRL: RL in the Presence of Exogenous Information
require: precision parameter ǫ > 0, reachability parameter η > 0, failure probability δ ∈ (0, 1).
initialize: Ψ(1) = ∅.
for h = 2, 3, · · · , H do
η/2,δ h−1 
Ψ(h) ← OSSRh {Ψ(t) }t=1 . // Learn policy cover via OSSR (Algorithm 8 in Appendix G).
ǫ,δ (h) H

b ← ExoPSDP {Ψ }h=1 .
π // Apply ExoPSDP (Algorithm 7 in Appendix F) to optimize rewards.
return π b

4 Main Result: Sample-Efficient RL in the Presence of Exogenous


Information
In this section we provide our main algorithm, ExoRL (Algorithm 3). ExoRL first applies OSSR iteratively
to learn an endogenous, η/2-approximate policy cover for each layer, then applies a novel variant of the
classical Policy Search by Dynamic Programming method of (Bagnell et al., 2004) (ExoPSDP), which uses
the covers to optimize rewards; the original PSDP method cannot be applied to the ExoMDP setting as-is
due to subtle statistical issues (cf. Appendix F for background). The main guarantee for ExoRL is as follows;
see Appendix H for a proof and overview of analysis techniques.
Theorem 4.1 (Sample complexity of ExoRL). ExoRL, when invoked with parameter, ǫ ∈ (0, 1) and δ ∈ (0, 1),
returns an ǫ-optimal policy with probability at least 1 − δ, and does so using at most
   
dSAH 
O AS 3k H 2 (S k + H 2 )k 3 log · ǫ−2 + η −2
δ
episodes.
Recall that S k = |S[I⋆ ]| may thought of as the cardinality of the endogenous state space so—up to polynomial
factors, logarithmic dependence on d, and dependence on the reachability parameter η, the sample complexity
of ExoRL matches the optimal sample complexity when I⋆ is known in advance.
Pk 
Remark 4.1 (Computational Complexity of ExoRL). The runtime for ExoRL scales with k′ =0 kd′ = Θ(dk )
due to brute force enumeration over factors sets of size at most k. While this improves over the S d runtime
required to run a tabular RL algorithm over the full state space, an interesting question that remains is
whether the runtime can be improved to O(dc ) for some constant c independent of k.

5 Related Work
In this section we highlight additional related work not already covered by our discussion.

Reinforcement learning with exogenous information. The ExoMDP setting is a special case of the
Exogenous Block MDP (EX-BMDP) setting introduced by Efroni et al. (2021b), who initiated the study
of sample-efficient reinforcement learning with temporally correlated exogenous information. In particular,
one can view the ExoMDP as an EX-BMDP with S as the observation space and S[I⋆ ] as the latent state
space, and with the set Φ := {s 7→ s[I] | |I| ≤ k} as the class of decoders. Efroni et al. (2021b) provide
an EX-BMDP algorithm whose sample complexity scales with the size of the latent state space and with
log|Φ|, which translates to poly(S k , log(d)) sample complexity for the ExoMDP setting, but the algorithm
requires that the endogenous state space has deterministic transitions and initial state. The motivation for
the present work was to take a step back and provide a simplified testbed in which to study the problem of
learning with stochastic transitions, as well as other refined issues (e.g., minimax rates). Also related to this
line of research is Efroni et al. (2021a), which considers a linear control setting with exogenous observations.
Unlike our work, Efroni et al. (2021a) assumes that the inherent system noise induces sufficient exploration,
and hence does not address the exploration problem.
Empirical works that aim to filter exogenous noise in deep RL include Pathak et al. (2017); Zhang et al.
(2020); Gelada et al. (2019), but these methods do not come with theoretical guarantees.

11
Tabular reinforcement learning. As discussed earlier, existing approaches to tabular reinforcement
learning (Azar et al., 2017; Jin et al., 2018; Zanette and Brunskill, 2019; Kaufmann et al., 2021) incur Ω(S d )
sample complexity if applied to the ExoMDP setting naively. One can improve this sample complexity to
poly(S k , dk , A, H) using a simple reduction. This falls short of the poly(S k , A, H, log(d)) sample complexity
our algorithms obtain, we sketch the reduction for completeness.
• For each I ⊆ [d] with |I| ≤ k, run any optimal tabular RL algorithm with precision parameter ǫ over
the state space S[I], and let πI be the resulting policy.
• Evaluate each policy πI to precision ǫ using Monte-Carlo rollouts, and take the best one.

The first phase has poly(S k , A, H) sample complexity for each set I, and there are at most kd = O(dk )
subsets. The algorithm that runs on S[I⋆ ] will succeed in finding an ǫ-optimal policy with high probability,
so the policy returned in the second phase will be at least 2ǫ-optimal.

Factored Markov decision processes. The ExoMDP setting is related to the Factored MDP model (Kearns
and Koller, 1999). Factored MDPs assume a factored state space whose transition dynamics obey the fol-
lowing structure:
d
Y
∀s, s′ ∈ S d , a ∈ A, T (s′ | s, a) = Ti (s′ [i] | s[pt(i)], a),
i=1
d |pt(i)|
where pt : [d] → 2 is a parent function and Ti : S × A → ∆(S) is the transition distribution of the
ith factor. Many algorithms have been proposed for Factored MDPs, including for the setting where the
parent function is unknown (Strehl et al., 2009; Diuk et al., 2009; Hallak et al., 2015; Guo and Brunskill,
2017; Rosenberg and Mansour, 2020; Misra et al., 2021). These algorithms assume that the parent factor
size is bounded, i.e., |pt(i)| ≤ κ for all i ∈ [d], and their sample complexity typically scales with O(|S|cκ ) for
a numerical constant c. The ExoMDP setting cannot be solved using off-the-shelf factored MDP algorithms
for two reasons. First, we do not assume that each factor evolves independently of other factors given the
previous state and action. Second, the size of the parent set for an exogenous factor can be as large as d − k.
Therefore, even if factors were evolving independently, applying off-the-shelf Factored MDPs algorithms
would lead to exponential sample in d sample complexity.

6 Conclusion
We have introduced the ExoMDP setting and provided ExoRL, the first algorithm for sample-efficient rein-
forcement learning in stochastic systems with high-dimensional, exogenous information. Going forward, we
believe that the ExoMDP setting will serve as a useful testbed to understand refined aspects of learning with
exogenous information. Natural questions we hope to see addressed include:
• Minimax rates. While our results provide polynomial sample complexity, it remains to understand the
precise minimax rate for the ExoMDP as a function on S k , H, and so on. Additionally, either removing
the dependence on the reachability parameter or establishing a lower bound remains for its necessity
is an issue which deserves further investigation.
• Computation. Both ExoRL and OSSR rely on brute force enumeration over subsets, which results in
Ω(dk ) runtime. While this provides an improvement over naive tabular RL, it remains to see whether
it is possible to develop an algorithm with runtime O(dc ), where c > 0 is a constant independent of k.
2/3
• Regret. Naively lifting our ǫ-PAC
√ results to regret results in T -type dependence on the time horizon
T . Developing algorithms with T -type regret will require new techniques.
• Parameter-free algorithms. The OSSR algorithm requires an upper bound on |I⋆ | and a lower bound
on η. It is relatively straightforward to remove access to these quantities when the value of the optimal
policy (maxπ J(π)) is known, by an application of the doubling trick. However, developing truly
parameter-free algorithms is an interesting direction.

12
Finally, the problem of learning in the ExoMDP model is related to the notion of out-of-distribution gen-
eralization and learning in the presence of acausal features (Peters et al., 2016; Arjovsky et al., 2019; Kim
et al., 2019; Wald et al., 2021). It would be interesting to explore these connections in more detail. Beyond
these questions, we hope that our techniques will find further use beyond the tabular setting.

References
Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. FLAMBE: Structural complexity and
representation learning of low rank MDPs. Advances in Neural Information Processing Systems, 2020.
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv
preprint arXiv:1907.02893, 2019.
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement
learning. In International Conference on Machine Learning, 2017.
J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic
programming. In Advances in Neural Information Processing Systems, 2004.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory
of independence. Oxford University Press, 2013.
Carlos Diuk, Lihong Li, and Bethany R Leffler. The adaptive k-meteorologists problem and its application
to structure learning and feature selection in reinforcement learning. In Proceedings of the 26th Annual
International Conference on Machine Learning, pages 249–256, 2009.
Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang.
Bilinear classes: A structural framework for provable generalization in RL. In International Conference
on Machine Learning, pages 2826–2836. PMLR, 2021.
Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudík, and John Langford.
Provably efficient RL with rich observations via latent state decoding. In International Conference on
Machine Learning, 2019.
Yonathan Efroni, Sham Kakade, Akshay Krishnamurthy, and Cyril Zhang. Sparsity in partially controllable
linear systems. arXiv preprint arXiv:2110.06150, 2021a.
Yonathan Efroni, Dipendra Misra, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Provable
RL with exogenous distractors via multistep inverse dynamics. arXiv preprint arXiv:2110.08847, 2021b.
Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive
decision making. arXiv preprint arXiv:2112.13487, 2021.
Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G Bellemare. DeepMDP: Learn-
ing continuous latent space models for representation learning. In International Conference on Machine
Learning, 2019.
Zhaohan Daniel Guo and Emma Brunskill. Sample efficient feature selection for factored MDPs. arXiv
preprint arXiv:1703.03454, 2017.
Assaf Hallak, François Schnitzler, Timothy Mann, and Shie Mannor. Off-policy model-based learning under
unknown factored dynamics. In International Conference on Machine Learning, pages 711–719. PMLR,
2015.
Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity. Monographs
on statistics and applied probability, 143:143, 2015.
Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual
decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine
Learning, 2017.

13
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In
Advances in Neural Information Processing Systems, 2018.
Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems,
and sample-efficient algorithms. Advances in Neural Information Processing Systems, 34, 2021.
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In In Proc.
19th International Conference on Machine Learning. Citeseer, 2002.
Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and
Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865–891. PMLR,
2021.
Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored MDPs. In International
Joint Conference on Artificial Intelligence, volume 16, pages 740–747, 1999.
Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, and Junmo Kim. Learning not to learn: Training
deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 9012–9020, 2019.
Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction
and provably efficient rich-observation reinforcement learning. In International conference on machine
learning, pages 6961–6971. PMLR, 2020.
Dipendra Misra, Qinghua Liu, Chi Jin, and John Langford. Provable rich observation reinforcement learning
with combinatorial latent states. In International Conference on Learning Representations, 2021.
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-
supervised prediction. In International Conference on Machine Learning, 2017.
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction:
identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 78(5):947–1012, 2016.
Aviv Rosenberg and Yishay Mansour. Oracle-efficient reinforcement learning in factored MDPs with unknown
structure. arXiv preprint arXiv:2009.05986, 2020.
Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite MDPs: PAC analysis.
Journal of Machine Learning Research, 10(11), 2009.
Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline RL in
low-rank MDPs. arXiv:2110.04652, 2021.
Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge
University Press, 2019.
Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization.
Advances in Neural Information Processing Systems, 34, 2021.
Ruosong Wang, Dean Foster, and Sham M Kakade. What are the statistical limits of offline RL with linear
function approximation? In International Conference on Learning Representations, 2021.
Gellért Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in MDPs
with linearly-realizable optimal action-value functions. In Algorithmic Learning Theory, pages 1237–1264.
PMLR, 2021.
Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning
without domain knowledge using value function bounds. In International Conference on Machine Learning,
pages 7304–7312. PMLR, 2019.
Amy Zhang, Rowan Thomas McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant
representations for reinforcement learning without reconstruction. In International Conference on Learning
Representations, 2020.

14
Contents of Appendix

I Preliminaries 17
A Supporting Lemmas 17
A.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.4 I≤k (I) is a π-System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

B Structural Results for ExoMDPs 22


B.1 Bellman Rank for the ExoMDP Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.2 Structural Results for State Occupancies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
B.3 Structural Results for Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

C Noise-Tolerant Search over Endogenous Factors: Algorithmic Template 30

II Omitted Subroutines 32
D Finding a Near-Optimal Endogenous Policy: EndoPolicyOptimization 32
D.1 Description of EndoPolicyOptimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
D.2 Proof of Theorem D.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

E Selecting Endogenous Factors with Strong Coverage: EndoFactorSelection 36


E.1 Description of EndoFactorSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
E.2 Proof of Theorem E.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

F PSDP with Exogenous Information: ExoPSDP 42


F.1 Description of ExoPSDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
F.2 Proof of Theorem F.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
F.3 Computational Complexity of ExoPSDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
F.4 Application of EndoPolicyOptimization within ExoPSDP . . . . . . . . . . . . . . . . . . . . . . 45

III Additional Details and Proofs for Main Results 47


G OSSR Description and Proof of Theorem 3.1 47
G.1 OSSR: Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
G.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
G.3 Proof of Theorem G.1 (Success of State Refinement Step) . . . . . . . . . . . . . . . . . . . . 51
G.4 Application of EndoPolicyOptimization in OSSR . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

H Proof of Theorem 4.1 (Correctness of ExoRL) 55


H.1 Computational Complexity of ExoRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

15
Organization and Notation
The appendix contains three parts, Part I, Part II, and Part III.

Part I: Preliminaries. In Part I we provide basic technical results used in our analysis. Appendix A con-
tains technical lemmas for reinforcement learning (Appendix A.1), concentration inequalities (Appendix A.2),
and basic analysis tools (Appendix A.3). In Appendix A.4, we provide a simple, yet useful result which shows
that the collection I≤k (I) is a π-system for any factor set I with |I| ≤ k.
In Appendix B we present structural results for the ExoMDP model. We begin by establishing a negative
result (Appendix B.1) which shows that the Bellman rank (Jiang et al., 2017) of for the ExoMDP model scales
with the number of exogenous factors. In Appendix B.2 and Appendix B.3, we prove key structural results for
the ExoMDP model, including a decoupling property (Lemma B.1) and restriction lemma (Lemma B.2) for
occupancy measures, a restriction lemma for endogenous rewards (Lemma B.7), and a performance difference
lemma for endogenous policies (Lemma B.6).
In Appendix C, we present an algorithmic template, AbstractFactorSearch, which forms the basis for the
subroutines in OSSR.
Notation used throughout the main paper and appendix is collected in Table 1.

Part II: Omitted subroutines. In Part II, we describe and analyze subroutines used by OSSR and
ExoRL. Appendix D presents and analyzes the EndoPolicyOptimization subroutine used in OSSR and ExoPSDP.
Appendix E we presents and analyzes the EndoFactorSelection subroutine used in OSSR. Finally, Appendix F
presents and analyzes ExoPSDP algorithm, which is used by ExoRL.

Part III: Additional details and proofs for main results. In Part III, we present our main results
and their proofs. In Appendix G, we present and analyze the full version of the OSSR algorithm, and in
Appendix H, we combine the results for OSSR and ExoPSDP to establish the main sample complexity bound
for ExoRL.

Notation Meaning
I an ordered set of factors (a set of distinct elements from [d]).
I≤k (I) {J ⊆ [d] : J ⊇ I, |J | ≤ k}.
Ik (I) {J ⊆ [d] : J ⊇ I, |J | = k}.
I≤k {J ⊆ [d] : |J | ≤ k}, or equivalently, I≤k = I≤k (∅) .
Ik {J ⊆ [d] : |J | = k}, or equivalently, Ik = Ik (∅).
Π[I] the set of policies that depend only on the factors specified in I.
Π[I ] the union of the set of policies ∪I∈I Π[I].
I⋆ the set of endogenous factors.
I⋆c the set of exogenous factors.
S [I] the set of states induced by the factors in I.
s [I] the state s restricted to the set of factors I.
V1π value of a policy π measured with respect to an initial distribution.
Vhπ (s) value of a policy
hPπ measured
i from state s at timestep h
h
Vt,h Vt,h (π) := Eπ t′ =t rt .
Qπh (s, a) Q-function for a policy π measured from state s at timestep h.
dh (s[I]; π) shorthand for Pπ (sh [I] = s[I]).
dh (s[I] | st [I ′ ] = s[I ′ ]; π) shorthand for Pπ (sh [I] = s[I] | st [I ′ ] = s[I ′ ]).
π1 ◦t π2 Policy that executes π1 until step t − 1 and executes π2 from then on.
Ien For a set of factors I, Ien := I ∩ I⋆ .
Iex For a set of factors I, Iex := I ∩ I⋆c .

Table 1: Summary of notation.

16
Part I

Preliminaries
A Supporting Lemmas
A.1 Reinforcement Learning
Lemma A.1 (Performance difference lemma (Kakade and Langford (2002), Lemma 6.1)). Consider a fixed
MDP M = (S, A, T, R, H, µ). For any pair of policies π, π ′ ∈ ΠNS ,
"H #
X ′
′ π π′ ′
J(π) − J(π ) = Eπ Qt (st , πt (st )) − Qt (st , πt (st )) .
t=1

Lemma A.2 (Density ratio bound for policy cover). Let Ψ be an endogenous ǫ-approximate policy cover
for timestep t and µ(t) := Unf (Ψ). Then, for any s[I⋆ ] ∈ S [I⋆ ] such that maxπ∈ΠNS [I⋆ ] dt (s[I⋆ ] ; π) ≥ 2ǫ, it
holds that
dt (s[I⋆ ] ; π)
max ≤ 2S k .
π∈ΠNS [I⋆ ] dt (s[I⋆ ]; µ(t) )

Proof of Lemma A.2. Fix s[I⋆ ] ∈ S [I⋆ ]. Since Ψ is an endogenous ǫ-approximate policy cover, there
exists ψs[I⋆ ] ∈ Ψ such that

max dt (s[I⋆ ]; π) ≤ dt (s[I⋆ ]; ψs[I⋆ ] ) + ǫ. (5)


π∈ΠNS [I⋆ ]

Thus, we have that

dt (s[I⋆ ]; π) (a) k dt (s[I⋆ ]; π)


max = S max P
π∈ΠNS [I⋆ ] dt (s[I⋆ ]; µ(t) ) π∈ΠNS [I⋆ ] s [I⋆ ]∈S[I⋆ ] dt (s[I⋆ ]; ψs′ [I⋆ ] )

(b) dt (s[I⋆ ]; π)
≤ Sk max
π∈ΠNS [I⋆ ] dt (s[I⋆ ]; πs[I⋆ ] )
(c) maxπ∈ΠNS [I⋆ ] dt (s[I⋆ ]; π)
≤ Sk .
maxπ∈ΠNS [I⋆ ] dt (s[I⋆ ]; π) − ǫ

Here, (a) holds because µ(t) = Unf (Ψ), (b) holds because dt (s[I⋆ ]; ψs′ [I⋆ ] ) ≥ 0 for all ψs′ [I⋆ ] ∈ Ψ, and (c)
holds by Eq. (5). Finally, since x/(x − ǫ) ≤ 2 for x ≥ 2ǫ, we conclude the proof.

A.2 Probability
Lemma A.3 (Bernstein’s Inequality (e.g.,
 Boucheron et al. (2013))). Let X1 , .., XN be a sequence of i.i.d.
2
random variables with E [Xi ] = µ, E (Xi − µ) = σ 2 , and |Xi − µ| ≤ C almost surely. Then for all
δ ∈ (0, 1),
 s  
XN 2 2 2
1 2σ log C log
P (Xi − µ) ≥ δ
+ δ 
≤ δ.
N i=1 N N

Lemma A.4 (Union bound for sequences). Let {Gt }ht=1 be a sequence of events. If P(Gt | ∩t−1
t′ =1 Gt′ ) ≥ 1 − δ
for all t ∈ [h], then P(∩ht=1 Gt ) ≥ 1 − hδ.

17
Proof of Lemma A.4. We prove the claim by induction. The base case h = 1 holds by assumption. Now,
suppose the claim holds for some h′ ≤ h:

P(∩ht=1 Gt ) ≥ 1 − h′ δ.
By Bayes’ rule, we have that

P(∩ht=1
+1
Gt )
′ ′
= P(Gh′ +1 | ∩ht=1 Gt )P(∩ht=1 Gt )
(a) ′
≥ P(Gh′ +1 | ∩ht=1 Gt ) (1 − h′ δ)
(b)
≥ (1 − δ) (1 − h′ δ)
≥ 1 − (h′ + 1)δ,
where (a) holds by the induction hypothesis and (b) holds by assumption of the lemma. This proves the
induction step and concludes the proof.

A.2.1 Concentration for Occupancy Measures


n o
b = dbh (· ; π) | π ∈ Π , be a set of
Definition A.1 (ǫ-approximate occupancy measure collection). Let D
b is ǫ-approximate with respect to (Π, I , h) if for all
occupancy measures for timestep h. We say that D
π ∈ Π, I ∈ I and s [I] ∈ S [I] it holds that

dbh (sh [I] = s [I] ; π) − dh (sh [I] = s [I] ; π) ≤ ǫ.

In the following lemma, we bound the sample complexity required to compute a set of ǫ-approximate occu-
pancy measures with respect to (µ ◦ Π ◦ Ψ, I , h), where µ is a fixed policy, Π is a set of 1-step policies, and
Ψ is a set of non-stationary policies. The proof follows from a simple application of Bernstein’s inequality
and a union bound.
Lemma A.5 (Sample complexity for ǫ-approximate occupancy measures). Let t, h ∈ N with t ≤ h be given.
Fix a mixture policy µ ∈ Πmix , a collection Γ ⊆ Π of 1-step policies, a set Ψ ⊆ ΠNS , and a collection of
factors I . Assume the following bounds hold:
1. |Ψ| ≤ S k .
 k

2. |Γ| ≤ O dk AS .

3. |I | ≤ O dk .
4. For any I ∈ I it holds that |S [I]| ≤ S k .
N N
Consider the dataset Zt,h = {(st,n , at,n , ψn , sh,n )}n=1 generated by the following process:
• Execute µ(t) := Unf(Ψ(t) ) up to layer t (resulting in state st,n ).
• Sample action at,n ∼ Unf(A) and play it, transitioning to st+1,n in the process.
• Sample ψn(t+1,h) ∼ Unf(Ψ(t+1,h) ) and execute it from layers t + 1 to h (resulting in sh,n ).
Define a collection of empirical occupancies
n o
Db = dbh (· ; µ ◦t π ◦t+1 ψ (t+1,h) ) | π ∈ Γ, ψ (t+1,h) ∈ Ψ ,

where dbh (· ; µ ◦t π ◦t+1 ψ (t+1,h) ) is given by (see also Line 5 in Algorithm 8)

1 X 1{at,n = π(st,n ), ψn(t+1,h) = ψ (t+1,h) , sh,n = s}


N
b
dh (s ; µ ◦t π ◦t+1 ψ (t+1,h)
)= . (6)
N n=1 (1/|A|) · (1/|Ψ|)

18
 
δ )
AS 2k k log( dSA
Then, whenever N = Ω b is ǫ-
trajectories, with probability at least 1 − δ it holds that D
ǫ2

approximate with respect to (µ ◦t Γ ◦t+1 Ψ, I , h).


N
Proof of Lemma A.5. Denote ρ as the policy that generates the data Zt,h . Fix π ∈ Γ, ψ ∈ Ψ, I ∈
I , s [I] ∈ S [I]. It holds that
dbh (s [I] ; µ ◦t π ◦t+1 ψ) − dh (s [I] ; µ ◦t π ◦t+1 ψ)
(a) X
= dbh (s ; µ ◦t π ◦t+1 ψ) − dh (s ; µ ◦t π ◦t+1 ψ)
s[I c ]∈S[I c ]

1 X 1{at,n = π(st,n ), ψn = ψ, sh,n [I] = s [I]}


N
= − dh (s [I] ; µ ◦t π ◦t+1 ψ)
N n=1 (1/|A|) · (1/|Ψ|)
N
1 X
= (Xn (π, ψ, sh [I]) − dh (s [I] ; µ ◦t π ◦t+1 ψ))
N n=1
where
1{at,n = π(st,n ), ψn = ψ, sh,n [I] = s [I]}
Xn (π, ψ, sh [I]) := .
(1/|A|) · (1/|Ψ|)
Note that (a) holds by definition: both dbh (s [I] ; µ ◦t π ◦t+1 ψ) and dh (s [I] ; µ ◦t π ◦t+1 ψ) are given by
marginalizing all state factors in I c . Observe that the estimator Xn is unbiased and bounded almost surely:
Eρ [Xn (π, ψ, s [I])] = dh (s [I] ; µ ◦t π ◦t+1 ψ), and 0 ≤ Xn (π, ψ, s [I]) ≤ A |Ψ| . (7)

As a result, we can control the quality of approximation of dbh (s [I] ; µ ◦t π ◦t+1 ψ) using Bernstein’s
inequality (Lemma A.3). First, observe that the variance of each term in the sum can be bounded as follows:
σ 2 := Eρ [(Xn (π, ψ, s [I]) − dh (s [I] ; µ ◦t π ◦t+1 ψ))]
(a)
2
≤ Eρ [Xn (π, ψ, s [I]) ]
(b)
≤ A |Ψ| Eρ [Xn (π, ψ, s [I])]
(c)
= A |Ψ| dh (s [I] ; µ ◦t π ◦t+1 ψ)
≤ A |Ψ| . (8)
Here (a) holds since dh (s [I] ; µ ◦t π ◦t+1 ψ) ≥ 0, (b) holds since 0 ≤ Xn (π, ψ, s [I]) ≤ A|Ψ|, and (c) holds by
Eq. (7). As a result, using Bernstein’s inequality, we have that for any fixed π ∈ Γ, ψ ∈ Ψ, I ∈ I , s [I] ∈ S [I],
with probability at least 1 − δ,

dbh (s [I] ; µ ◦t π ◦t+1 ψ) − dh (s [I] ; µ ◦t π ◦t+1 ψ)


s  
(a) σ 2 log 1δ A |Ψ| log 1δ
≤ O + 
N N
s  
(b) A |Ψ| log 1δ A |Ψ| log 1δ
≤ O + ,
N N
 
A|Ψ| log 1\δ
where (a) holds byLemma A.3 and (b) holds by Eq. (8). Setting N = Θ ǫ2 and using that ǫ2 ≤ ǫ
for ǫ ∈ (0, 1), we find that

dbh (s [I] ; µ ◦t π ◦t+1 ψ) − dh (s [I] ; µ ◦t π ◦t+1 ψ) ≤ O ǫ + ǫ2 ≤ ǫ.

Finally, taking a union bound over all π ∈ Γ, ψ ∈ Ψ, I ∈ I , s [I] ∈ S [I] and using assumptions (1) − (4), we
conclude the proof.

19
A.3 Analysis
The following elementary result shows that if two functions fb, f : X → R are point-wise close, any approxi-
mate optimizer for fb is an approximate optimizer for f .
Lemma A.6. Let X be a compact set, and let f, fb : X → R be such that

||fb − f ||∞ := max fb(x) − f (x) ≤ ǫ.


x∈X

Then, for any ǫ′ > 0, the following results hold:


1. If maxx∈X fb(x) > minx∈X fb(x) + ǫ′ , then maxx∈X f (x) > minx∈X f (x) + ǫ′ − 2ǫ.
2. If maxx∈X fb(x) ≤ minx∈X fb(x) + ǫ′ , then maxx∈X f (x) ≤ minx∈X f (x) + ǫ′ + 2ǫ.
b ∈ X , if maxx∈X fb(x) > fb(b
3. For any x x) + ǫ′ , then maxx∈X f (x) > f (b
x) + ǫ′ − 2ǫ.
b ∈ X , if maxx∈X fb(x) ≤ fb(b
4. For any x x) + ǫ′ , then maxx∈X f (x) ≤ f (b
x) + ǫ′ + 2ǫ.
Proof of Lemma A.6. Denote the maximizer and minimizer of f by

xmin,f := arg min f (x), xmax,f := arg max f (x),


x∈X x∈X

and denote the maximizer and minimizer of fb by

xmin,fb := arg min fb(x), xmax,fb := arg max fb(x).


x∈X x∈X

Note that these points exist by compactness of X .


Observe that the following relations hold by the assumption that ||fb − f ||∞ ≤ ǫ:

max fb(x) = fb(xmax,fb) ≤ f (xmax,fb) + ǫ ≤ max f (x) + ǫ, (9)


x∈X x∈X

min fb(x) = fb(xmin,fb) ≥ f (xmin,fb) − ǫ ≥ min f (x) − ǫ, (10)


x∈X x∈X

max fb(x) ≥ fb(xmax,f ) ≥ f (xmax,f ) − ǫ = max f (x) − ǫ, (11)


x∈X x∈X

min fb(x) ≤ fb(xmin,f ) ≤ f (xmin,f ) + ǫ = min f (x) + ǫ. (12)


x∈X x∈X

Proof of the first claim. Combining relations Eq. (9), Eq. (10) and rearranging, we have

max fb(x) > min fb(x) + ǫ′ =⇒ max f (x) > min f (x) + ǫ′ − 2ǫ.
x∈X x∈X x∈X x∈X

Proof of the second claim. Combining relations Eq. (11), Eq. (12) and rearranging, we have

max fb(x) ≤ min fb(x) + ǫ′ =⇒ max f (x) ≤ min f (x) + ǫ′ + 2ǫ.


x∈X x∈X x∈X x∈X

Proof of the third claim. By Eq. (9) and the assumption that ||f − fb||∞ ≤ ǫ, we have

max fb(x) > fb(b


x) + ǫ′ =⇒ max f (x) > f (b
x) + ǫ′ − 2ǫ.
x∈X x∈X

Proof of the fourth claim. By Eq. (11) and the assumption that kf − fbk∞ ≤ ǫ, we have

max fb(x) ≤ fb(b


x) + ǫ′ =⇒ max f (x) ≤ f (b
x) + ǫ′ + 2ǫ.
x∈X x∈X

20
Lemma A.7 (Equivalence of Maximizers for Scaled Positive Functions). Let X , Y, and A be finite sets.
Let f : X × A → R and g : Y → R+ and let P be a probability measure over X × Y. Let ΠX ×Y and ΠX be
the sets of all mappings from X × Y to A and X to A, respectively. Then,
max Ex,y∼P [f (x, π(x, y))g(y)] = max Ex,y∼P [f (x, π(x))g(y)] .
π∈ΠX ,Y π∈ΠX

Proof of Lemma A.7. By the skolemization lemma (Lemma A.9), we can exchange maximization and
expectation by writing
 
max Ex,y∼P [f (x, π(x, y))g(y)] = Ex,y∼P max (f (x, a)g(y)) . (13)
π∈ΠX ,Y a∈A

Let πf⋆ ∈ ΠX be defined via


πf⋆ (x) ∈ max f (x, a).
a
Observe that for any x, y ∈ X × Y it holds that
(a)
max (f (x, a)g(y)) = g(y) max f (x, a) = g(y)f (x, πf⋆ (x)), (14)
a a

where (a) holds because g(y) ≥ 0. Plugging Eq. (14) back into Eq. (13) we find that
(a)   (b)
max Ex,y∼P [f (x, π(x, y))g(y)] = Ex,y∼P f (x, πf⋆ (x))g(y) ≤ max Ex,y∼P [f (x, π(x))g(y)] , (15)
π∈ΠX ,Y π∈ΠX

where (a) holds by Eq. (14), and (b) holds since πf⋆ ∈ ΠX . Finally, observe that we trivially have
max Ex,y∼P [f (x, π(x, y))g(y)] ≥ max Ex,y∼P [f (x, π(x))g(y)] , (16)
π∈ΠX ,Y π∈ΠX

since ΠX ⊆ ΠX ,Y . Combining Eq. (15) and Eq. (16) yields the result.

Lemma A.8. Let k, k1 , k2 ∈ N satisfying 1 ≤ k2 ≤ k1 − 1 ≤ k be given. Then, for all ǫ > 0,


k−k1 k−k2
(1 + 1/k) ǫ + ǫ/3k < (1 + 1/k) ǫ.
k−k1 k−k2
This further implies that (1 + 1/k) cǫ + ǫ/3k < (1 + 1/k) cǫ for all c ≥ 1.
Proof of Lemma A.8. We prove the result by explicitly bounding the difference:
 
(1 + 1/k)k−k1 ǫ + ǫ/3k − (1 + 1/k)k−k2 ǫ = (1 + 1/k)k2 −k1 − 1 (1 + 1/k)k−k2 ǫ + ǫ/3k
(a)  
−1 k−k
≤ (1 + 1/k) − 1 (1 + 1/k) 2 ǫ + ǫ/3k

= − (1 + 1/k)k−k2 ǫ/(1 + k) + ǫ/3k


(b)
≤ −ǫ/(1 + k) + ǫ/3k.
Here, relation (a) holds since k2 − k1 ≤ −1 and (1 + 1/k) ≥ 1, and relation (b) holds since k − k2 ≥ 1 which
k−k
implies that (1 + 1/k) 2 ≥ 1. Observe that 3k > 1 + k for k ≥ 1 which implies that
−ǫ/(1 + k) + ǫ/3k < 0
k−k1 k−k2
for ǫ > 0. Thus, under the assumptions of the lemma, we have that (1 + 1/k) ǫ+ǫ/3k−(1 + 1/k) ǫ<
0, which implies that
(1 + 1/k)k−k1 ǫ + ǫ/3k < (1 + 1/k)k−k2 ǫ.

The following result is standard, so we omit the proof.


Lemma A.9 (Skolemization). Let S and A be finite sets and Π be the set of mappings from S to A. Then
for any function f : S × A → R, maxπ∈Π E[f (s, π(s))] = E[maxa f (s, a)].

21
A.4 I≤k (I) is a π-System
We now prove that I≤k (I) is a π-system (that is, a set system that is closed under intersection). Importantly,
this implies that if I⋆ ∈ I≤k (I), then for any I ∈ I≤k (I), I⋆ ∩ I := Ien ∈ I≤k (I). This fact is repeatedly
being in the design and analysis of OSSR in Section 3.3.
Lemma A.10 (I≤k (I) is a π system). For any I ∈ I≤k , I≤k (I) is a π-system:
1. I≤k (I) is non-empty.
2. For any I1 , I2 ∈ I≤k (I), we have I1 ∩ I2 ∈ I≤k (I).
Proof of Lemma A.10. Since I ∈ I≤k , we have |I| ≤ k. Furthermore, it trivially holds that I ⊆ I.
Thus, I ∈ I≤k (I), which implies that I≤k (I) is non-empty.
We now prove the second claim. By definition, every J ∈ I≤k (I) has I ⊆ J . Thus, for any I1 , I2 ∈ I≤k (I),

I ⊆ I1 ∩ I2 . (17)

Furthermore, since, both |I1 | ≤ k and |I2 | ≤ k, we have

|I1 ∩ I2 | ≤ min{|I1 | , |I2 |} ≤ k. (18)

Combining Eq. (17) and Eq. (18) implies that I1 ∩ I2 ∈ I≤k (I).

B Structural Results for ExoMDPs


B.1 Bellman Rank for the ExoMDP Setting
In this section we show that in general, the ExoMDP setting does not admit low Bellman rank (Jiang et al.,
2017), which is a standard structural complexity measure that enables tractable reinforcement learning in
large state spaces. We expect that similar arguments apply for the related complexity measures (Jin et al.,
2021; Du et al., 2021) and other variations. We note that Efroni et al. (2021b) showed that the more general
Exogenous Block MDP model does not admit low Bellman rank. Here, we show that the same conclusion
holds for the specialized ExoMDP model.
Recall that Bellman rank is a complexity measure that depends on the underlying MDP and on a class
of action-value functions F used to approximate Q⋆ . For a policy π, denote the average Bellman error of
function f ∈ F by

Eh (π, f ) := Esh ∼π,ah ∼πf [f (sh , ah ) − rh − f (sh+1 , πf (sh+1 )] .

With ΠF := {πf : f ∈ F } we define Eh (ΠF , F ) = {Eh (π, f )}π∈ΠF ,f ∈F as the matrix of Bellman residuals
indexed by policies and value functions. The Bellman rank is defined as maxh rank(Eh (ΠF , F )).
Proposition B.1. For every d = 2i for i ∈ N, there exists (i) an ExoMDP with S = 3, A = 2, H = 2,
d exogenous factors and 1 endogenous factor, and (ii) a function class F containing of d functions, one of
which is Q⋆ and the rest of which induce policies that are 1/8 sub-optimal, such that such that the Bellman
rank is at least d − 1.
Proof. We construct a ExoMDP with H = 2, A = {1, 2} (so that A = 2), a single endogenous factor with
values in {1, 2, 3}, and d binary exogenous factors with values in {0, 1}.
Let ei ∈ Rd denote the ith standard basis element. We take the first factor to be endogenous, and construct
the initial distribution, transition dynamics, and rewards as follows:
• d1 = Unif({(1, ei )}i∈[d] ).
• T ((2, ei ) | (1, ei ), 1) = 1, and T ((3, ei ) | (1, ei ), 2) = 1.
• R((2, ei ), ·) = 1/2, and R((3, ei ), ·) = 3/4.

22
There is only a single, terminal action at states (2, ei ), (3, ei ), which we suppress from the notation. It is
straightforward to verify that this is an ExoMDP. Note that the optimal policy takes action 2 at the initial
state, and we have V ⋆ = 3/4.
We first construct the class F . Since d is a power of 2, there exist subsets A1 , . . . , Ad−1 ⊂ [d] such that:3

∀j ∈ [d − 1] : |Aj | = d/2, ∀j 6= k ∈ [d − 1] : |Aj ∩ Ak | = d/4.

We define F = {f0 , f1 , . . . , fd−1 }, with f0 = Q⋆ and each fj associated with subset Aj as follows:

fj ((1, ei ), 2) = 3/4, fj ((3, ei ), ·) = 3/4


fj ((1, ei ), 1) = 1{i ∈ Aj }, fj ((2, ei ), ·) = 1{i ∈ Aj }

Observe that since there is no reward, each function has zero Bellman error at the first timestep (that is,
E1 (πfi , fj ) = 0 ∀i, j ∈ {0, . . . , d − 1}). On the other hand for j, k ∈ [d − 1] we have
d
1X
E2 (πfj , fk ) = 1{i ∈ Aj }(fk ((2, ei ), ·) − 1/2) + 1{i ∈
/ Aj }(fk ((3, ei ), ·) − 3/4)
d i=1
d
1X
= 1{i ∈ Aj }(fk ((2, ei ), ·) − 1/2)
d i=1
d
1X
= 1{i ∈ Aj ∩ Ak }(1 − 1/2) + 1{i ∈ Aj ∩ Āk }(0 − 1/2)
d i=1
1
= 1{j = k},
2
where we have used that |Aj ∩ Ak | = |Aj ∩ Āk | = d/4 when j 6= k. This shows that we can embed a
(d − 1) × (d − 1) identity matrix in E2 (ΠF , F ), so we have rank(E2 (ΠF , F )) ≥ d − 1.

B.2 Structural Results for State Occupancies


In this section we provide structural results concerning the state occupancy measures in the ExoMDP model.
These results refine certain results derived for the more general EX-BMDP model in Efroni et al. (2021b).
For the first result, we adopt the shorthand

dπh (s[I]) := dh (s[I] ; π) := Pπ (sh [I] = s[I]).

Lemma B.1 (Decoupling of state occupancy measures). Fix t, h ∈ [H] such that t ≤ h. Let π ∈ ΠNS [I⋆ ] be
an endogenous policy and let I be any factor set. Then for any s′ [I] ∈ S [I] and s ∈ S, a ∈ A the following
claims hold.
1. dπh (s′ [I] | st = s, at = a) = dπh (s′ [Ien ] | st [I⋆ ] = s[I⋆ ], at = a) · dh (s′ [Iex ] | st [I⋆c ] = s[I⋆c ]).
2. dπh (s′ [I] | st = s) = dπh (s′ [Ien ] | st [I⋆ ] = s[I⋆ ]) · dh (s′ [Iex ] | st [I⋆c ] = s[I⋆c ]).
3. For any endogenous mixture policy µ ∈ Πmix [I⋆ ] and factor set I,

dµh (s[I]) = dµh (s[Ien ]) · dh (s[Iex ]).

Hence, the random variables (sh [Ien ], sh [Iex ]) are independent under µ.
Proof of Lemma B.1. The proof follows a simple backwards induction argument.

Proof of Claims 1 and 2. We prove the two claims by induction on t′ = h − 1, .., t.


3 This can be seen by associating the sets with rows of a Walsh matrix.

23
Base case: t′ = h − 1. The base case holds as an immediate consequence of the ExoMDP structure. In
more detail, we have the following results.
1. Claim 1.
dπh (s′ [I] | sh−1 = s, ah−1 = a)
X
= T (s′ [I] | s, a)
s′ [I c ]∈S[I c ]
X X
= T (s′ [I⋆ ] | s[I⋆ ], a)T (s′ [I⋆c ] | s[I⋆c ])
s′ [I⋆ \Ien ]∈S[I⋆ \Ien ] s′ [I⋆c \Iex ]∈S[I⋆c \Iex ]
X X
= T (s′ [I⋆ ] | s[I⋆ ], a) T (s′ [I⋆c ] | s[I⋆c ])
s[I⋆ \Ien ]∈S[I⋆ \Ien ] s[I⋆c \Iex ]∈S[I⋆c \Iex ]

= dπh (s′ [Ien ] | sh−1 [I⋆ ] = s[I⋆ ], ah−1 = a) dh (s′ [Iex ] | sh−1 [I⋆c ] = s[I⋆c ]) . (19)

2. Claim 2.
dπh (s′ [I] | sh−1 = s)
(a) X π
= dh (s′ [I] | sh−1 = s, ah−1 = a) πh−1 (a | s [I⋆ ])
a∈A
(b) X
= dπh (s′ [Ien ] | sh−1 [I⋆ ] = s[I⋆ ], ah−1 = a) dh (s′ [Iex ] | sh−1 [I⋆c ] = s[I⋆c ]) πh−1 (a | s [I⋆ ])
a∈A
X
= dh (s′ [Iex ] | sh−1 [I⋆c ] = s[I⋆c ]) dπh (s′ [Ien ] | sh−1 [I⋆ ] = s[I⋆ ], ah−1 = a) πh−1 (a | s [I⋆ ])
a∈A
(c) ′
= dh (s [Iex ] | sh−1 [I⋆c ] = s[I⋆c ]) dπh (s′ [Ien ] | sh−1 [I⋆ ] = s[I⋆ ]) .
Here (a) holds by Bayes’ rule and because π ∈ Π[I⋆ ] is endogenous policy, (b) holds by Eq. (19), and
(c) holds by Bayes’ rule and the law of total probability.

Induction step Fix t′ < h − 1 and assume the induction hypothesis holds for t′ + 1.
1. Claim 1.
dπh (s′ [I] | st′ = s, at′ = a)
X
= dπh (s′ [I] | st′ +1 = s̄) P(st′ +1 = s̄ | st′ = s, at′ = a)
s̄∈S
(a) X
= dπh (s′ [I] | st′ +1 = s̄) T (s̄[I⋆ ] | s[I⋆ ], a)T (s̄[I⋆c ] | s[I⋆c ])
s̄∈S
(b) X
= dπh (s′ [Ien ] | st′ +1 [I⋆ ] = s̄[I⋆ ]) T (s̄[I⋆ ] | s[I⋆ ], a)
s̄[I⋆ ]∈S[I⋆ ]
X
× dh (s′ [Iex ] | st′ +1 [I⋆c ] = s̄[I⋆c ]) T (s̄[I⋆c ] | s[I⋆c ])
s̄[I⋆c ]∈S[I⋆c ]

= dπh (s′ [Ien ] | st′ [I⋆ ] = s[I⋆ ], at′ = a) dh (s′ [Iex ] | st′ [I⋆c ] = s[I⋆c ]) , (20)
where (a) holds by the ExoMDP model assumption (Section 2), and (b) holds by the induction hypoth-
esis.
2. Claim 2.
dπh (s′ [I] | st′ = s)
(a) X π
= dh (s′ [I] | st′ = s, at′ = a) πt′ (a | s [I⋆ ])
a∈A

24
(b) X
= dπh (s′ [Ien ] | st′ [I⋆ ] = s[I⋆ ], at′ = a) dh (s′ [Iex ] | st′ [I⋆c ] = s[I⋆c ]) πt′ (a | s [I⋆ ])
a∈A
X
= dh (s′ [Iex ] | st′ [I⋆c ] = s[I⋆c ]) dπh (s′ [Ien ] | st′ [I⋆ ] = s[I⋆ ], at′ = a) πt′ (a | s [I⋆ ])
a∈A
(c) ′
= dh (s [Iex ] | st′ [I⋆c ] = s[I⋆c ]) dπh (s′ [Ien ] | st′ [I⋆ ] = s[I⋆ ]) .

Here (a) holds by Bayes’ rule and because π ∈ Π[I⋆ ] is endogenous policy, (b) holds by Eq. (20), and
(c) holds by Bayes’ rule and law of total probability.
This proves the induction step and both claims.

Proof of Claim 3. We first prove the claim holds for π ∈ ΠNS [I⋆ ]. That is, for any π ∈ ΠNS [I⋆ ], factor
set I and s[I], we have

dπh (s[I]) = dπh (s[Ien ]) · dh (s[Iex ]). (21)

This yields the result, since for µ ∈ Πmix [I⋆ ], Eq. (21) implies that

dµh (s[I]) = Eπ∼µ [dπh (s[I])]


= Eπ∼µ [dπh (s[Ien ]) · dh (s′ [Iex ])]
= Eπ∼µ [dµh (s[Ien ])]dh (s[Iex ]) = dµh (s[Ien ]) · dh (s[Iex ]).

We now prove Eq. (21). Fix π ∈ ΠNS [I⋆ ], and observe that
(a)
dπh (s) = Es1 ∼d1 [dπh (s | s1 )]
(b)
= Es1 ∼d1 [dπh (s[I⋆ ] | s1 [I⋆ ] = s[I⋆ ]) dπh (s[I⋆c ] | s1 [I⋆c ] = s[I⋆c ])]
(c)
= Es1 [I⋆ ]∼dh [dπh (s[I⋆ ] | s1 [I⋆ ] = s[I⋆ ])] Es1 [I⋆c ]∼dh [dπh (s[I⋆c ] | s1 [I⋆c ] = s[I⋆c ])]
= dπh (s [I⋆ ])dh (s [I⋆c ]). (22)

Relation (a) holds by the tower property, and relation (b) holds by the second claim of the lemma, because
π is an endogenous policy. Relation (c) holds because s1 [I⋆ ] and s1 [I⋆c ] are independent (by the ExoMDP
model assumption, we have d1 (s) = d1 (s[I⋆ ])d1 (s[I⋆c ])).
The relation in Eq. (22) now implies the result:
(a) X X
dπh (s[I]) = dπh (s)
s[I⋆ \Ien ]∈S[I⋆ \Ien ] s[I⋆c \Iex ]∈S[I⋆c \Iex ]
(b) X X
= dπh (s [I⋆ ])dh (s [I⋆c ])
s[I⋆ \Ien ]∈S[I⋆ \Ien ] s[I⋆c \Iex ]∈S[I⋆c \Iex ]
X X
= dπh (s [I⋆ ]) dh (s [I⋆c ])
s[I⋆ \Ien ]∈S[I⋆ \Ien ] s[I⋆c \Iex ]∈S[I⋆c \Iex ]

= dπh (s′ [Ien ])dh (s′ [Iex ]),

where (a) holds by the law of total probability and (b) holds by Eq. (22).

Lemma B.2 (Restriction lemma). Fix h, t ∈ [H] where t ≤ h − 1. Let µ ∈ Πmix [I⋆ ] and ρ ∈ ΠNS [I⋆ ] be
endogenous policies. Let J and I be two factor sets. Then, for all s [I] ∈ S [I] it holds that

max dh (s [I] ; µ ◦t π ◦t+1 ρ) = max dh (s [I] ; µ ◦t π ◦t+1 ρ) .


π∈Π[J ] π∈Π[Jen ]

25
Let us briefly sketch the proof. To begin, we marginalize over the factor set J c := [d] \ J at layer t. We
then show that if µ and ρ are endogenous policies, then for all π ∈ Π and s [I] ∈ S[I],
dh (s [I] ; µ ◦t π ◦t+1 ρ) = Est ∼dt (s[J ] ; π) [f (st [Jen ] , π (st [J ]))ḡ(st [Jex ])] (23)
where both f and ḡ are maps to R+ . We observe that the policy
πf (s [Jen ]) ∈ argmax f (s [Jen ] , π (s [J ]))
a

also maximizes Eq. (23). The result follows by observing that πf ∈ Π[Jen ].
Proof of Lemma B.2. Fix s [I] ∈ S [I]. The following relations hold.
dh (s [I] ; µ ◦t π ◦t+1 ρ)
(a)  
= Es[J ]∼dt (· ; µ) Es[J c ]∼dt (·|st [J ]=s[J ] ; µ) [dh (s [I] | st = s ; µ ◦t π ◦t+1 ρ)]
(b)  
= Es[J ]∼dt (· ; µ) Es[J c ]∼dt (·|st [J ]=s[J ] ; µ) [dh (s [I] | st = s ; µ ◦t π ◦t+1 ρ)] , (24)
where (a) holds by the tower property, and (b) holds by the Markov assumption of the dynamics: conditioning
on the full state s at timestep t, the future is independent of the history.

(⋆) := dh (s [I] | st = s ; µ ◦t π ◦t+1 ρ) ,


(⋆⋆) := Es[J c ]∼dt (·|st [J ]=s[J ] ; µ) [dh (s [I] | st = s ; µ ◦t π ◦t+1 ρ)] .

Analysis of term (⋆). Let π ∈ Π[J ]. Fix s ∈ S at the tth timestep, and observe that a = π(s [J ]) is also
fixed, since the policy π is a deterministic function of s [J ] .
dh (s [I] | st = s ; µ ◦t π ◦t+1 ρ)
(a)
= dh (s [I] | st = s, at = π (s[J ]) ; ρ)
(b)
= dh (s [Ien ] | st [I⋆ ] = s[I⋆ ], at = π (s[J ]) ; ρ) · dh (s [Iex ] | st [I⋆c ] = s[I⋆c ]) . (25)
| {z } | {z }
¯ t [I⋆ ],π(s[J ]))
=:f(s =:ḡ(st [I⋆c ])

Relation (a) holds by the Markov property for the MDP, and relation (b) holds by the first statement
of Lemma B.1, which shows the the endogenous and exogenous state factors are decoupled; note that the
assumptions of Lemma B.1 hold because ρ is endogenous policy and a = π (s[J ]) is fixed. In addition, both
f¯(·) and ḡ(·) are mappings to R+ .

Analysis of term (⋆⋆). We consider term (⋆⋆) and analyze it by marginalizing over the state factors not
contained in s [J ]. Observe that dt (s[J c ] | st [J ] = s [J ] ; µ) also factorizes between the endogenous and
exogenous factors due to decoupling lemma (Lemma B.1, Claim 3):
dt (s[J c ] | st [J ] = s [J ] ; µ)
= dt (s[I⋆ \ Jen ] | st [Jen ] = s [Jen ] ; µ) dt (I⋆c \ Jex | st [Jex ] = s [Jex ]) . (26)
Hence, we have
Es[J c ]∼dt (·|st [J ]=s[J ] ; µ) [dh (s [I] | st = s ; µ ◦t π ◦t+1 ρ)]
(a)  
= Es[J c ]∼dt (·|st [J ]=s[J ] ; µ) f¯(s [I⋆ ] , π (s [J ]))ḡ(s [I⋆c ])
(b)  
= Es[I⋆ \Jen ]∼dt (·|st [Jen ]=s[Jen ] ; µ) f¯(s [I⋆ ] , π (s [J ])) Es[I⋆c \Jex ]∼dt (·|st [Jex ]=s[Jex ]) [ḡ(s [I⋆c ])], (27)
| {z }| {z }
=:f (s[Jen ],π(s[J ])) =:g(s[Jex ])

where (a) holds by the calculation of term (⋆) in Eq. (25), and (b) holds by the decoupling of the occupancy
measure dt (s[J c ] | st [J ] = s [J ] ; µ) in Eq. (26).

26
Combining the results. Plugging the expression in Eq. (27) back into Eq. (24) yields

dh (s [I] ; µ ◦t π ◦t+1 ρ) = Es[J ]∼dt (· ; µ) [f (st [Jen ] , π (st [J ]))g(st [Jex ])] . (28)

We conclude the proof by invoking Lemma A.7, which gives


(a)
max dh (s [I] ; µ ◦t π ◦t+1 ρ) = max Es[J ]∼dt (· ; µ) [(f (s [Jen ] , π (s [J ]))) g(s [Jex ])]
π∈Π[J ] π∈Π[J ]
(b)
= max Es[J ]∼dt (· ; µ) [(f (s [Jen ] , π (s [Jen ]))g(s [Jex ]))]
π∈Π[Jen ]
(c)
= max dh (s [I] ; µ ◦t π ◦t+1 ρ) .
π∈Π[Jen ]

Relations (a) and (c) hold by Eq. (28). Relation (b) holds by invoking Lemma A.7 with X = S[Jen ], Y =
S [Jex ] , X × Y = S[J ], f (x, a) = f (s[Jen ], a), g(y) = g(s[Jex ]), ΠX ×Y = Π[J ] and ΠX = Π[Jen ].

The result is proven as a consequence of the restriction lemma (Lemma B.2).


Lemma B.3 (Existence of endogenous policy cover). Fix h, t ∈ [H] with t ≤ h − 1. Let µ ∈ Πmix [I⋆ ] and
ρ ∈ ΠNS [I⋆ ] be endogenous policies. Let I be a factor set and I be a collection of factor sets with I⋆ ∈ I .
Then for all s [I] ∈ S [I],

max dh (s [I] ; µ ◦t π ◦t+1 ρ) = max dh (s [I] ; µ ◦t π ◦t+1 ρ) .


π∈Π[I ] π∈Π[I⋆ ]

Proof of Lemma B.3. For all J = Jen ∪ Jex ∈ I and s [I] ∈ S [I], we have
(a)
max dh (s [I] ; µ ◦t π ◦t+1 ρ) = max dh (s [I] ; µ ◦t π ◦t+1 ρ)
π∈Π[J ] π∈Π[Jen ]
(b)
≤ max dh (s [I] ; µ ◦t π ◦t+1 ρ) , (29)
π∈Π[I⋆ ]

where (a) holds by Lemma B.2, and (b) holds because Π[Jen ] ⊆ Π[I⋆ ] (since Jen ⊆ I⋆ ). Since Eq. (29) holds
for all J ∈ I , we conclude that

max dh (s [I] ; µ ◦t π ◦t+1 ρ) ≤ max dh (s [I] ; µ ◦t π ◦t+1 ρ) . (30)


π∈Π[I ] π∈Π[I⋆ ]

On the other hand, since Π[I⋆ ] ⊆ Π[I ] it trivially holds that

max dh (s [I] ; µ ◦t π ◦t+1 ρ) ≥ max dh (s [I] ; µ ◦t π ◦t+1 ρ) . (31)


π∈Π[I ] π∈Π[I⋆ ]

Combining Eq. (30) and Eq. (31) yields the result.

Consider the problem of finding a policy π that maximizes

dh (s [I] ; µ ◦t π ◦t+1 ρ) , (32)

b is an endogenous
where both µ and ρ are endogenous policies. Our next result (Lemma B.4) shows that if π
policy that is approximately optimal for reaching s [Ien ] in the sense that

b ◦t+1 ρ) + ǫ,
max dh (s [Ien ] ; µ ◦t π ◦t+1 ρ) ≤ dh (s [Ien ] ; µ ◦t π (33)
π∈Π[I ]

then it is also approximately optimal for Eq. (32), in the sense that

b ◦t+1 ρ) + ǫ.
max dh (s [I] ; µ ◦t π ◦t+1 ρ) ≤ dh (s [I] ; µ ◦t π
π∈Π[I ]

27
Lemma B.4 (Optimizing for endogenous factors is sufficient). Fix h, t ∈ [H] with t ≤ h − 1. Let µ ∈
b ∈ Π and ρ ∈ ΠNS be given. Let I be a factor set and I be a collection of factor sets such that
Πmix , π
I⋆ ∈ I . Fix s[I] ∈ S[I] and assume that:
b are endogenous .
(A1) µ, ρ and π
b is approximately optimal for s [Ien ]:
(A2) π

b ◦t+1 ρ) + ǫ.
max dh (s [Ien ] ; µ ◦t π ◦t+1 ρ) ≤ dh (s [Ien ] ; µ ◦t π
π∈Π[I ]

Then

b ◦t+1 ρ) + ǫ.
max dh (s [I] ; µ ◦t π ◦t+1 ρ) ≤ dh (s [I] ; µ ◦t π
π∈Π[I ]

Proof of Lemma B.4. By assumption (A1), µ and ρ are endogenous policies, so Lemma B.3 yields

max dh (s [I] ; µ ◦t π ◦t+1 ρ) = max dh (s [I] ; µ ◦t π ◦t+1 ρ) . (34)


π∈Π[I ] π∈Π[I⋆ ]

Next, we observe that the following relations hold


 
(a)
max dh (s [I] ; µ ◦t π ◦t+1 ρ) = max dh (s [Ien ] ; µ ◦t π ◦t+1 ρ) dh (s [Iex ])
π∈Π[I⋆ ] π∈Π[I⋆ ]
(b)
b ◦t+1 ρ) dh (s [Iex ]) + ǫ
≤ dh (s [Ien ] ; µ ◦t π
(c)
b ◦t+1 ρ) + ǫ
= dh (s [Ien ] , s [Iex ] ; µ ◦t π
= dh (s [I] ; µ ◦t π b ◦t+1 ρ) + ǫ. (35)

Relation (a) holds by Lemma B.1, as µ ◦t π ◦t+1 ρ is an endogenous policy. Relation (b) holds by assumption
(A2) and because dh (s [Iex ]) ≤ 1. Relation (c) holds by Lemma B.1; note that assumptions of the lemma
b ◦t+1 ρ is endogenous. Combining Eq. (34) and Eq. (35) concludes the proof.
are satisfied because µ ◦t π

B.3 Structural Results for Value Functions


In this section we provide a structural results concerning the values functions for endogenous policies in
the ExoMDP model. These results leverage the assumption that the rewards depend only on endogenous
components. We repeatedly invoke the notion of an endogenous MDP Men = (S[I⋆ ], A, Ten , Ren , H, d1,en ),
which corresponds to the restriction of an ExoMDP M to the endogenous component of the state space.
Note that only endogenous policies are well-defined in the endogenous MDP. We also denote the state-action
and state value functions of an endogenous policy measured in Men as Qπh,en (s[I⋆ ], a), and Vh,en
π
(s[I⋆ ]).
Our first result is a straightforward extension of Proposition 5 in Efroni et al. (2021b). It shows that the
value function for any endogenous policy in an ExoMDP is an endogenous function in the sense that it only
depends on the endogenous state factors.
Lemma B.5 (Value functions for endogenous policies are endogenous). Let π ∈ ΠNS [I⋆ ] be an endogenous
policy, and assume that the reward function is endogenous. Then, for any t ∈ [H] and s ∈ S, we have
π
Vtπ (s) = Vt,en (s[I⋆ ]) and Qπt (s, a) = Qπt,en (s[I⋆ ], a),
π
where Vt,en and Qπt,en are value functions for π in the endogenous MDP Men = (S[I⋆ ], A, Ten , Ren , H, d1,en ).
H
Proof of Lemma B.5. Let R = {Rh }h=1 denote the reward function. We prove the result via induction.
The base case t = H holds by the assumption that the reward is endogenous. Next, assume the claim is
correct for t + 1, and let us prove it for t. Since Rt is endogenous, the inductive hypothesis yields

Vtπ (s)

28
 π

= Eπ Ren,t (s[I⋆ ], πt (s[I⋆ ])) + Vt,en+1 (st+1 [I⋆ ])|st = s, a = πt+1 (s[I⋆ ])
(a)
= Ren,t (s[I⋆ ], πt (s[I⋆ ]))
X X
+ Ten (s′ [I⋆ ] | s[I⋆ ], πt+1 (s[I⋆ ])) Vt,en+1
π
(s′ [I⋆ ]) Ten (s′ [I⋆c ] | s[I⋆c ])
s′ [I ⋆ ]∈S[I⋆ ] s′ [I⋆c ]∈S[I⋆c ]
(b) X
= Ren,t (s[I⋆ ], πt (s[I⋆ ])) + Ten (s′ [I⋆ ] | s[I⋆ ], πt+1 (s[I⋆ ]))) Vt,en+1
π
(s′ [I⋆ ]), (36)
s′ [I ⋆ ]∈S[I⋆ ]

where (a) holds by the factorization


P of the transition operator (see Eq. (1)), and (b) holds by marginalizing
the exogenous factors, since s′ [I c ]∈S[I c ] Ten (s′ [I⋆c ] | s[I⋆c ]) = 1. Finally, observe that Eq. (36) is the pre-
⋆ ⋆
cisely the value function for π in the endogenous MDP Men = (S[I⋆ ], A, Ten , Ren , H, d1,en ), which concludes
the proof.

Lemma B.6 (Performance difference lemma for endogenous policies). Let π, π ′ ∈ ΠNS [I⋆ ] be endogenous
policies. Then
"H #
X ′ ′
J(π) − J(π ′ ) = Eπ Qπt (st [I⋆ ], πt (st [I⋆ ])) − Qπt (st [I⋆ ], πt′ (st [I⋆ ])) .
t=1

Proof of Lemma B.6. For any endogenous policy π, observe that


(a) (b)
J(π) := Es1 ∼d1 [V1π (s1 )] = Es1 ∼d1 [V1π (s1 [I⋆ ])] = Es1 [I⋆ ]∼d1,en [V1π (s1 [I⋆ ])] = Jen (π), (37)

Relation (a) holds by Lemma B.5, since Jen (π) is the averaged value of V1π (s1 ) with respect to the initial
endogenous distribution. Relation (b) holds by marginalizing out s1 [I⋆c ], since V1π (s1 [I⋆ ]) does not depend
on this quantity. Using (37) and applying the standard performance difference lemma to the endogenous
MDP Men now yields
"H #
X ′ ′
J(π) − J(π ′ ) = Jen (π) − Jen (π ′ ) = Eπ Qπt (st [I⋆ ], πt (st [I⋆ ])) − Qπt (st [I⋆ ], πt′ (st [I⋆ ])) .
t=1

Lemma B.7 (Restriction lemma for endogenous rewards). Fix t ≤ h. Let µ ∈ Πmix [I⋆ ] and ρ ∈ ΠNS [I⋆ ] be
endogenous policies. Define
" h #
X
Vt,h (µ ◦t π ◦t+1 ρ) := Eµ◦t π◦t+1 ρ rt′ . (38)
t′ =t

Assume that R is an endogenous reward function. Then for any factor set I, we have

max Vt,h (µ ◦t π ◦t+1 ψ) = max Vt,h (µ ◦t π ◦t+1 ψ) .


π∈Π[I] π∈Π[Ien ]

To prove this result, we generalize the proof technique used in the restriction lemma for state occupancy
measures (Lemma B.2).
Proof of Lemma B.7. Since µ ∈ Πmix [I⋆ ] is an endogenous policy, the occupancy measure at the tth
timestep factorizes. That is, by the third statement of Lemma B.1, we have that

dt (s[I] ; µ) = dt (s[Ien ] ; µ) dt (s[Iex ]) .

For each s[I] ∈ S[I], the conditional state occupancy measure factorize as well:

dt (s[I c ] | st [I] = s[I] ; µ)

29
= dt (s[I⋆ \ Ien ] | st [Ien ] = s[Ien ] ; µ) dt (s[I⋆c \ Iex ] | st [Iex ] = s[Iex ]) . (39)

Let Qρt,en be the Q function on the endogenous MDP Men = (S[I⋆ ], A, Ten , Ren , h, d1,en ) when executing
policy ρ starting from timestep t + 1. We can express the value function as follows:

Vt,h (µ ◦t π ◦t+1 ρ)
= Eµ [Qρt (st [[d]], πt (st [I]))]
(a)  
= Eµ Qρt,en (st [I⋆ ], πt (st [I]))
  
= Es[I]∼dt (· ; µ) Es[I c ]∼dt (·|st [I]=s[I] ; µ) Qρt,en (s[I⋆ ], πt (s[I]))
(b)   
= Es[I]∼dt (· ; µ) Es[[I⋆ \Ien ]∼dt (·|st [Ien ]=s[Ien ] ; µ) Qρt,en (s[I⋆ ], πt (s[I])) . (40)

Relation (a) holds by Lemma B.5, since ρ is an endogenous policy. Relation (b) holds by decoupling of
conditional occupancy measure (Eq. (39)), and because Qρt,en (s[I⋆ ], πt (s[I])) does not depend on state
factors in I⋆c \ Iex , which are marginalized out.
To proceed, define
 
f (st [Ien ], πt (s[I])) := Es[[I⋆ \Ien ]∼dt (·|st [Ien ]=s[Ien ] ; µ) Qρt,en (s[I⋆ ], πt (s[I])) .

With this notation, we can rewrite the expression in Eq. (40) as

Vt,h (µ ◦t π ◦t+1 ρ) = Es[I]∼dt (· ; µ) [f (st [Ien ], πt (s[I]))]. (41)

We now invoke Lemma A.7, which shows that


(a)
max Vt,h (µ ◦t π ◦t+1 ρ) = max Es[I]∼dt (· ; π) [f (s[Ien ], π(s[I]))]
π∈Π[I] π∈Π[I]
(b)
= max Es[I]∼dt (· ; π) [f (s[Ien ], π(s[I]))]
π∈Π[Ien ]
(c)
= max Es[I]∼dt (· ; π) [f (s[Ien ], π(s[I]))]
π∈Π[Ien ]

Relations (a) and (c) holds by Eq. (41). Relation (b) holds by invoking Lemma A.7, with X = S[Ien ], Y =
S [Iex ] , X × Y = S[I], f (x, a) = f (s[Jen ], a), g(y) = 1, and ΠX ×Y = Π[I] and ΠX = Π[Ien ].

C Noise-Tolerant Search over Endogenous Factors: Algorithmic


Template
In this section we provide a general template for designing error-tolerant algorithms that search over en-
dogenous factors sets. This template is used in both EndoPolicyOptimizationǫt,h and EndoFactorSelectionǫt,h
(subroutines of OSSR).
Our algorithm design template, AbstractFactorSearch is presented in Algorithm 4. Let us describe the mo-
tivation. Let Z be an abstract “dataset” (typically, a collection of trajectories), let ǫ > 0 be a precision
parameter, and let Condition(Z, ǫ, I) ∈ {true, false} be an abstract function defined over factor sets I.
AbstractFactorSearch addresses the problem of finding an endogenous factor set Ib ⊆ I⋆ such that

b = true
Condition(Z, C · ǫ, I) (42)

for a numerical constant C ≥ 1, assuming that the endogenous factors I⋆ satisfy the condition themselves:

Condition(Z, ǫ, I⋆ ) = true. (43)

30
Algorithm 4 AbstractFactorSearch
1: require: abstract dataset Z, precision ǫ, initial endogenous factor I0 ⊆ I⋆ .
2: for k ′ = |I0 | , |I0 | + 1, . . . , k do
k−k′
3: Set ǫk′ = (1 + 1/k) ǫ.
4: for I ∈ Ik′ (I0 ) do
5: if Condition(Z, ǫk′ , I) = true then return Ib ← I.
b
6: return fail.

For example, within EndoPolicyOptimizationǫt,h , Condition(Z, ǫ, I) checks whether policies that act on the
factor set I lead to ǫ-optimal value for a given reward function (approximated using trajectories in Z).
AbstractFactorSearch begins with an initial set of endogenous factors I0 ⊆ I⋆ . Naturally, since I⋆ ∈ I≤k (I0 )
and I⋆ is known to satisfy Eq. (43), a naive approach would be to enumerate over the collection I≤k (I0 )
to find a factor set Ib ∈ I≤k (I0 ) that satisfies Eq. (42). For example, considering the following procedure:
• For each I ∈ I≤k (I0 ), check whether Condition(Z, Cǫ, I) = true.
• If so, return Ib ← I.
It is straightforward to see that this approach returns a factor set Ib ∈ I≤k (I0 ) that satisfies Eq. (42),
but the issue is that there is nothing preventing Ib from containing exogenous factors. AbstractFactorSearch
resolves this problem by searching for factors in a bottom-up fashion. The algorithm begins by searching over
factor sets with minimal cardinality (k ′ = |I0 |), and gradually increases the size until a factor set satisfying
(42) is found.
In more detail, observe that we have

I≤k (I0 ) = ∪kk′ =|I0 | Ik (I0 ) ,

where

Ik (I0 ) := {I ′ ⊆ [d] | I0 ⊆ I ′ , |I ′ | = k} .

Starting from k ′ = |I0 |, AbstractFactorSearch checks whether exists a set of factors I ∈ Ik′ (I0 ) that satisfies
k−k′
Condition(· · · ) with respect to an accuracy parameter ǫk′ = (1 + 1/k) ǫ; this choice allows for larger
errors for smaller k ′ . When a set of factors I satisfies Eq. (42) AbstractFactorSearch halts and returns this
set; otherwise, k ′ is increased. For this approach to succeed, we assume that Condition satisfies the following
property.
Assumption C.1. For any set of factors I = Ien ∪ Iex with |Iex | ≥ 1, it holds that

Condition(Z, ǫ|I| , I) = true =⇒ Condition(Z, ǫ|Ien | , Ien ) = true. (44)

We now describe three key steps used to prove that this scheme succeeds.
1. AbstractFactorSearch does not return fail. This follows immediately from the assumption that (43) is
satisfied.
2. AbstractFactorSearch returns an endogenous set of factors. Observe that the assumption I⋆ ∈ I≤k (I0 )
implies that for any I ∈ I≤k (I0 ), Ien := I⋆ ∩I ∈ I≤k (I0 ); this follows from Lemma A.10. Hence, if I
satisfies Eq. (42), Assumption C.1 implies that Ien satisfies Eq. (42) as well. Since AbstractFactorSearch
scans I≤k (I0 ) in a bottom-up fashion, this means it must return an endogenous factor set, since it
will verify that Ien satisfies Eq. (42) prior to I.
3. AbstractFactorSearch is near-optimal. Since (1 + 1/k)k−k ǫ ≤ 3ǫ for all k ′ ∈ [k], the factor set Ib returned

b = true.
by AbstractFactorSearch satisfies Condition(Z, 3ǫ, I)

31
Part II

Omitted Subroutines
D Finding a Near-Optimal Endogenous Policy: EndoPolicyOptimization
Algorithm 5 EndoPolicyOptimizationǫt,h : One-Step Endogenous Policy Optimization
// Find an endogenous policy π ∈ Π[I≤k ] that approximately maximizes Vt,h (µ ◦t π ◦t+1 ψ), where µ ∈ Πmix and ψ ∈ ΠNS
are fixed policies.
1: require:
• Starting timestep t, end timestep h, and target precision ǫ ∈ (0, 1).

• Collection Vbt,h (µ ◦t π ◦t+1 ψ) π∈Π[I ] of estimates for Vt,h (µ ◦t π ◦t+1 ψ) for all π ∈ Π[I≤k ].
≤k
2: for k ′ = 0, 1, · · · , k do
k−k′
3: Let ǫk′ = (1 + 1/k) ǫ.
4: for I ∈ Ik′ do
5: Set is_cover = true if

max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫk′ .


π∈Π[I≤k ] π∈Π[I]

6: b ∈ argmaxπ∈Π[I] Vbt,h (µ ◦t π ◦t+1 ψ).


if is_cover = true then return: π
7: return: fail.

In this section, we introduce and analyze the EndoPolicyOptimizationǫt,h algorithm (Algorithm 5), which is
used in the optimization phase of OSSRǫ,δ
h (Appendix G) and in ExoPSDP (Appendix F). In Appendix D.1
we give a high-level description and intuition for the algorithm, and in Appendix D.2 we prove the main
theorem regarding its correctness and sample complexity.

D.1 Description of EndoPolicyOptimization.


The goal of EndoPolicyOptimizationǫt,h is to return a policy π
b ∈ Π[I] such that:
b is endogenous in the sense that π
1. π b ∈ Π[I] for some I ⊆ I⋆ .
b is near-optimal in the sense that
2. π

max b ◦t+1 ψ) + O (ǫ) ,


Vt,h (µ ◦t π ◦t+1 ψ) ≤ Vt,h (µ ◦t π
π∈Π[I≤k ]

hP i
h
where Vt,h (π) := Eπ t′ =t rt for a given reward function R.

bt,h (µ ◦t π ◦t+1 ψ) that are ǫ-close


EndoPolicyOptimization assumes access to approximate value functions V
to the true value functions Vt,h (µ ◦t π ◦t+1 ψ). Given these approximate value functions, finding a near-
b ∈ argmaxπ∈Π[I≤k ] Vbt,h (µ ◦t π ◦t+1 ψ).
optimal policy is trivial; it suffices to take the empirical maximizer π
However, finding a near-optimal endogenous policy is a more challenging task. For this, EndoPolicyOptimization
applies the abstract endogenous factor search scheme described in Appendix C (AbstractFactorSearch), which
regularizes toward factors with smaller cardinality.
EndoPolicyOptimizationǫt,h splits the set I≤k as I≤k = ∪kk′ =0 Ik′ , where Ik′ is the collection of factor sets with
cardinality exactly k ′ ∈ [k], and follows the bottom-up search strategy in AbstractFactorSearch. Beginning
from k ′ = 0, . . . , k, the algorithm checks whether there exists a near-optimal policy in the class Π[Ik′ ]. If
such a policy is found, the algorithm returns it, and otherwise it proceeds to k ′ + 1.

32
Intuition for correctness. We prove the correctness of the EndoPolicyOptimizationǫt,h procedure by follow-
ing the general template in Appendix C. In particular, we view EndoPolicyOptimizationǫt,h as a special case of
the AbstractFactorSearch (Algorithm 4) scheme with
( )
Condition(Z, ǫ, I) = 1 max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫ .
π∈Π[I≤k ] π∈Π[I]

Most the effort in proving the correctness of the algorithm is in showing that this condition satisfies Assump-
tion C.1. In particular, we need to show that if some I ∈ I≤k satisfies the condition in Line 5,

max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫ|I| ,


π∈Π[I≤k ] π∈Π[I]

then Ien := I ∩ I⋆ also satisfies the condition in the sense that

max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫ|Ien | .


π∈Π[I≤k ] π∈Π[Ien ]

This can be shown to hold as a consequence of assumptions (A1) and (A2) in Theorem D.1. Assumption
(A1) asserts the following restriction property holds: For any I,

max Vt,h (µ ◦t π ◦t+1 ψ) = max Vt,h (µ ◦t π ◦t+1 ψ) .


π∈Π[I] π∈Π[Ien ]

Hence, optimizing over a larger policy class that acts on exogenous factors does not improve the value.
Assumption (A2) asserts that the estimates for Vt,h (µ ◦t π ◦t+1 ψ) are uniformly ǫ-close, so that optimizing
with respect to these estimates is sufficient.

Importance of the decoupling property. We emphasize that assumption (A1) is non-trivial. We show it
holds for several choices for the reward function in the ExoMDP (Lemma B.2 and Lemma B.7), which are
used when we invoke the algorithm within OSSR. However, the condition my not hold if the endogenous
and exogenous factors are correlated. In this case, optimizing over exogenous state factors may improve the
value, leading the algorithm to fail.

Formal guarantee for EndoPolicyOptimization. The following result shows that EndoPolicyOptimizationǫt,h
returns a near-optimal endogenous policy.
Theorem D.1 (Correctness of EndoPolicyOptimizationǫt,h ). Fix h ∈ [H] and t ∈ [h]. Let µ ∈ Πmix and
ψ ∈ ΠNS be fixed policies. Assume the following conditions hold:
(A1) Restriction property: For any set of factors I,

max Vt,h (µ ◦t π ◦t+1 ψ) = max Vt,h (µ ◦t π ◦t+1 ψ) .


π∈Π[I] π∈Π[Ien ]

(A2) Quality of estimation. For all π ∈ Π[I≤k ],

Vt,h (µ ◦t π ◦t+1 ψ) − Vbt,h (µ ◦t π ◦t+1 ψ) ≤ ǫ/12k.

b output by EndoPolicyOptimizationǫt,h satisfies the following properties:


Then the policy π
b is endogenous: π
1. π b ∈ Π [I], where I ⊆ I⋆ .
b is near-optimal: maxπ∈Π[I≤k ] Vt,h (µ ◦t π ◦ ψ) ≤ Vt,h (µ ◦t π
2. π b ◦ ψ) + 4ǫ.

D.2 Proof of Theorem D.1


We use the three-step proof recipe described in Appendix C to prove correctness of EndoPolicyOptimization.

33
Step 1: EndoPolicyOptimizationǫt,h does not return fail. By definition, there exists I ∈ I≤k such that

max Vbt,h (µ ◦t π ◦t+1 ψ) = max Vbt,h (µ ◦t π ◦t+1 ψ) .


π∈Π[I≤k ] π∈Π[I]

Thus, Line 5 is satisfied, since ǫk′ ≥ 0.

Step 2: EndoPolicyOptimizationǫt,h returns an endogenous policy. Since EndoPolicyOptimizationǫt,h does


b ∈ Π[I] for some factor set I. We prove that I is an endogenous factor
not return fail, it returns a policy π
set, which implies that π b is an endogenous policy. We show this by proving the following claim:
Claim 1. If I satisfies the condition in Line 5 (is_cover = true for I), then Ien satisfies the condition as
well (is_cover = true for Ien ).
Given this claim, it is straightforward to see that EndoPolicyOptimizationǫt,h returns an endogenous policy.
First, observe that for any I ∈ I≤k , we have Ien := I ∩ I⋆ ∈ I ∈ I≤k by Lemma A.10 (since I⋆ ∈ I≤k ). If
|Ien | < |I|, then EndoPolicyOptimizationǫt,h verifies that Ien ∈ I≤k satisfies Line 5 prior to verifying whether
I ∈ I≤k satisfies the condition. It follows that the factor set returned by the algorithm must be endogenous.

Proof of Claim 1. Assume that I contains at least one exogenous factor, so

|Ien | ≤ |I| − 1. (45)

Suppose that is_cover = true for I. By construction, it holds that for k1 := |I| ≤ k,

max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫk1 . (46)


π∈Π[I≤k ] π∈Π[I]

This statement, which holds for the approximate value Vbt,h (µ ◦t π ◦t+1 ψ) implies a similar statement on the
true value Vt,h (µ ◦t π ◦t+1 ψ). Specifically, Eq. (46) together with Lemma A.6 (which can be applied using
assumption (A2)), implies that

max Vt,h (µ ◦t π ◦t+1 ψ) ≤ max Vt,h (µ ◦t π ◦t+1 ψ) + ǫk1 + ǫ/6k


π∈Π[I≤k ] π∈Π[I]

(a)
= max Vt,h (µ ◦t π ◦t+1 ψ) + ǫk1 + ǫ/6k, (47)
π∈Π[Ien ]

and (a) holds by the restriction property in assumption (A1).


We now relate the inequality in Eq. (47), which holds for the true values Vt,h (µ ◦t π ◦t+1 ψ), back to an
inequality on the approximate values. Using Lemma A.6 and assumption (A2) on Eq. (47), we have htat

max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫk1 + ǫ/3k


π∈Π[I≤k ] π∈Π[Ien ]

(a)
≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫk2 , (48)
π∈Π[Ien ]

where (a) holds for all k1 , k2 ∈ [k] such that k2 ≤ k1 − 1, since

ǫk1 + ǫ/3k := (1 + 1/k)k−k1 ǫ + ǫ/3k ≤ (1 + 1/k)k−k2 ǫ := ǫk2 ,

by Lemma A.8. Setting k2 = |Ien | ≤ k1 − 1 = |I| (the cardinality of Ien is strictly smaller than that of I
by Eq. (45)) and plugging this value into Eq. (48) yields

max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫ|Ien | . (49)


π∈Π[I≤k ] π∈Π[Ien ]

Hence, Ien also satisfies the conditions in Line 5.

34
Step 3: EndoPolicyOptimizationǫt,h returns a near-optimal policy. When the condition of EndoPolicyOptimizationǫt,h
at Line 5 holds and is_cover = true, the factor set I satisfies

max Vbt,h (µ ◦t π ◦t+1 ψ) ≤ max Vbt,h (µ ◦t π ◦t+1 ψ) + ǫ|I|


π∈Π[I≤k ] π∈Π[I]

= Vbt,h (µ ◦t π
b ◦t+1 ψ) + ǫ|I|
≤ Vbt,h (µ ◦t π
b ◦t+1 ψ) + 3ǫ, (50)
k
where the last relation holds because ǫ|I| ≤ (1 + 1/k) ǫ ≤ 3ǫ. Applying Lemma A.6 with (A2) then gives

max b ◦t+1 ψ) + 3ǫ + ǫ/6k .


Vt,h (µ ◦t π ◦t+1 ψ) ≤ Vt,h (µ ◦t π
π∈Π[I≤k ] | {z }
≤4ǫ

35
E Selecting Endogenous Factors with Strong Coverage: EndoFactorSelection
Algorithm 6 EndoFactorSelectionǫt,h : Simultaneous Policy Cover for all Factors
// Find I such that reaching I implicitly leads to good coverage for all J ∈ I≤k (I (t+1,h) ).
1: require:
• Starting timestep t and end timestep h, target precision ǫ ∈ (0, 1).
• Set of endogenous factors I (t+1,h) ⊆ I⋆ .
• Collection of policy sets {Γ(t) [I]}I∈I≤k (I (t+1,h) ) , where
 (t)
Γ(t) [I] = πs[I] | s[I] ∈ S[I] .

• Set of (t + 1 → h) policies

Ψ(t+1,h) = ψs(t+1,h) | s[I (t+1,h) ] ∈ S[I (t+1,h) ] .
[I (t+1,h) ]

• Collection D b of approximate occupancy measures for layer h under the sampling process µ(t) ◦t
π ◦t+1 ψ (t+1,h)
.
 
// Pick Ψ ∈ Γ(t) I ◦t+1 Ψ(t+1:h) that explores I ⊆ I⋆ and sufficiently explores other factors.
2: for k ′ = |I (t+1,h) |, |I (t+1,h) | + 1, .., k do
k−k′
3: Define ǫk′ = (1 + 1/k) 5ǫ.
4: for I ∈ Ik′ (I (t+1,h) ) do
// Test whether reaching states in I leads to good coverage for all factors J ∈ I≤k (I (t+1,h) ).
5: Set sufficient_cover = true if for all J ∈ I≤k (I (t+1,h) ) and for all s [J ] ∈ S [J ] :

 
max dbh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I
(t+1,h)
(t+1,h) ]
π∈Π[I≤k ]
 
≤ dbh s [J ] ; µ(t) ◦t πs[J
(t)
∩I] ◦ t+1 ψ (t+1,h)
s[I (t+1,h) ]
+ ǫk ′ , (51)
 
(t)
where πs[J ∩I] ∈ Γ [J ∩ I].
(t) (t)
// Recall πs[J ∩I]
≈ argmax dbh s [J ∩ I] ; µ(t) ◦t π ◦t+1 ψ(t+1,h)
(t+1,h) .
s[I ]
π∈Π[I≤k ]
6: if sufficient_cover = true then
7: Ib ← I.
8: return (I, b Γ(t) [I]).
b
9: return: fail. // Low probability failure event.

In this section, we describe and analyze the EndoFactorSelectionǫt,h algorithm (Algorithm 6). EndoFactorSelectionǫt,h
is a subroutine used in the selection phase of OSSRǫ,δ h , and generalizes the selection phase used in OSSR.Exacth
to the setting where only approximate occupancy measures are available. In Appendix E.1, we give a high-
level description EndoFactorSelectionǫt,h , give intuition, and state the main theorem concerning its performance.
Then, in Appendix E.2 we prove this result.

E.1 Description of EndoFactorSelection


To motivate EndoFactorSelectionǫt,h , let us first recall the selection phase of OSSR.Exacth (Line 7 of Algo-
rithm 2). The selection phase assumes access to a collection of policy sets {Γ(t) [I]}I∈I≤k (I (t+1,h) ) , which are
(t)
calculated in the optimization step. In particular, for each set I and each s[I] ∈ S[I], πs[I] ∈ Γ(t) [I] is an
endogenous policy that maximizes the probability of reaching s[I] at layer h in the following sense:
 
(t) (t+1,h)
πs[I] ∈ argmax dh s[I] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] .
π∈Π[I≤k ]

36
The selection phase of OSSR.Exacth find the factor set Ib ∈ I≤k (I (t+1,h) ) of minimal size such that for all
J ∈ I≤k (I (t+1,h) ) and s[J ] ∈ S[J ],
   
(t+1,h)
max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] = dh s[J ] ; µ(t) ◦t π (t) b ◦t+1 ψs[I
(t+1,h)
(t+1,h) ] . (52)
π∈Π[I≤k ] s[J ∩I]

At the end of the selection step, OSSR.Exacth outputs the tuple (I, b Γ(t) [I]).
b Since Ib is chosen as the minimal
b satisfies
factor set that satisfies Eq. (52) it can be shown it is an endogenous factors set. Furthermore, Γ(t) [I]
condition Eq. (52).
EndoFactorSelectionǫt,h is similar to OSSR.Exacth , but only requires access to approximate state occupancy
measures. Analogous to OSSR.Exacth , the algorithm outputs a tuple (I, b Γ(t) [I]),
b where Ib is an endogenous
(t) b
factors set and Γ [I] ensures good coverage at layer h.However, since EndoFactorSelectionǫt,h has only has
access to approximate state occupancy measures, the policy set Γ(t) [I] b returned by the algorithm is only
guaranteed to satisfy an approximate version of Eq. (52):
 
(t+1,h)
max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ]
π∈Π[I≤k ]
 
≤ dh s[J ] ; µ(t) ◦t π (t) b ◦t+1 ψs[I
(t+1,h)
(t+1,h) ] + O(ǫ), (53)
s[J ∩I]

(t) (t) b
where πs[J b ∈ Γ [I].
∩I]

To ensure find an endogenous factor set Ib such that Γ(t) [I]


b satisfies Eq. (53), EndoFactorSelectionǫt,h follows the
AbstractFactorSearch scheme described in Appendix C. It enumerates the collection of factor sets I≤k (I (t+1,h) )
in a bottom-up fashion—starting from factor sets of minimimal cardinality— and checks whether each factor
set approximately satisfies the optimality condition.

Intuition for correctness. To establish the correctness of EndoFactorSelectionǫt,h , we view the algorithm
as an instance of AbstractFactorSearch with

Condition(Z, ǫ, I)
   
 max dbh s[J ] ; µ ◦t π ◦t+1 ψs[I (t+1,h) ]
 
(t) (t+1,h)

=1
π∈Π[I≤k ] (t+1,h)
  ∀J ∈ I (I ), s[J ] ∈ S[J ] ,

 ≤ dbh s [J ] ; µ(t) ◦t π (t) (t+1,h) 

s[J ∩I] ◦t+1 ψs[I (t+1,h) ] + ǫ,

(t)
and recall that πs[J ∩I] ∈ Γ [J ∩ I] is the output from the optimization step at EndoPolicyOptimization. The
(t)

analysis of EndoFactorSelectionǫt,h follow the recipe sketched in Appendix C. Most of our efforts are devoted
to proving that the condition in Eq. (44) required by AbstractFactorSearch holds for EndoFactorSelectionǫt,h . In
particular, we wish to prove the following claim: If I satisfies the condition in Line 5 (sufficient_cover = true
for I), then Ien satisfies the condition as well (sufficient_cover = true for Ien ). To show that the statement
is true, we use a key structural result, Lemma B.4, which generalizes certain structural results used in the
analysis of OSSR.Exact (Proposition 3.1). Let µ and ρ be endogenous policies, and consider a fixed state
factor s[I] ∈ S[I]. Lemma B.4 asserts that if an endogenous policy πs[Ien ] approximately maximizes the
probability of reaching the endogenous part of s[I], which is given by

dh s[Ien ] ; µ ◦t πs[Ien ] ◦t+1 ρ ,

then the policy also approximately maximizes the probability of reaching s[I], which is given by

dh s[I] ; µ ◦t πs[Ien ] ◦t+1 ρ .

Hence, to approximately maximize the probability of reaching s[I], it suffices to execute a policy that
approximately maximizes the probability of reaching the endogenous part of the state, s[Ien ]. We use this
observation to show that exogenous factors are redundant in the sense that if sufficient_cover = true for I,
then sufficient_cover = true for Ien ; this proves the claim

37
Formal guarantee for EndoFactorSelection The following result is the main guarantee for EndoFactorSelectionǫt,h .

Theorem E.1 (Success of EndoFactorSelectionǫt,h ). Fix h ∈ [H] and t ∈ [h]. Assume the following conditions
hold:
(A1) Endogeneity of arguments. µ(t) ∈ Πmix [I⋆ ] is endogenous, Ψ(t+1,h) contains only endogenous policies,
and Γ(t) [I] contains only endogenous policies for all I ∈ I≤k (I (t+1,h) ). In addition, I (t+1,h) ⊆ I⋆ .
(A2) Quality of estimation. D b is a collection of ǫ/12k-approximate state occupancy measures with respect to
(µ ◦ Π[I≤k ] ◦ Ψ
(t) (t+1,h)
, I≤k (I (t+1,h) ) , h) (Definition A.1).
(t)
(A3) Optimality for Γ(t) [I]. For any factor setI ∈ I≤k (I (t+1,h) ) and any s[I] ∈ S[I], the policy πs[I] ∈
Γ [I] satisfies the following optimality guarantee:
(t)

   
max dh s [I] ; µ(t) ◦t π ◦t+1 ψs(t+1,h) ≤ dh s [I] ; µ (t)
◦ π (t)
◦ ψ (t+1,h)
t s[I] t+1 s I (t+1,h) + 4ǫ.
π∈Π[I≤k ] [I (t+1,h) ] [ ]

b Γ(t) [I])
Then EndoFactorSelectionǫt,h does not output fail, and the tuple (I, b output by the algorithm satisfies the
following guarantees:
1. Ib ⊆ I⋆ .
2. For all s [I⋆ ] ∈ S [I⋆ ], we have
   
max dh s [I⋆ ] ; µ(t) ◦t π ◦t+1 ψs(t+1,h) − dh s [I⋆ ] ; µ (t)
◦ π (t)
t s[I] ◦ ψ (t+1,h)
b t+1 s[I (t+1,h) ] ≤ 16ǫ,
π∈Π[I⋆ ] [I (t+1,h) ]
   
where we note that we can write s [I⋆ ] = (s Ib , s I⋆ \ Ib ) = (s [I (t+1,h) ] , s [I⋆ \ I (t+1,h) ]) because
I (t+1,h) , Ib ⊆ I⋆ .

E.2 Proof of Theorem E.1


We use the three-step proof strategy described in Appendix C to prove correctness for EndoFactorSelectionǫt,h .

Step 1: EndoFactorSelectionǫt,h does not return fail. We show that given assumptions (A1) − (A3)
EndoFactorSelectionǫt,h does not return fail. First, observe that I⋆ ∈ I≤k (I (t+1,h) ), since I (t+1,h) ⊆ I⋆ by (A1)
and |I⋆ | ≤ k by assumption. We prove that EndoFactorSelectionǫt,h halts for I ← I⋆ ; meaning that I⋆ satisfies
the condition at Line 5 of EndoFactorSelectionǫt,h .
Fix I ∈ I≤k (I (t+1,h) ) and s[I] ∈ S[I]. Let Ien ∈ I≤k (I (t+1,h) )4 be the endogenous component of I, so that
(t) (t)
s[I] = (s[Ien ], s[Iex ]). Consider the policy πs[I en ]
∈ Γ(t) [Ien ]. By assumption (A3), πs[I en ]
is endogenous and
satisfies
 
max dh s [Ien ] ; µ(t) ◦t π ◦t+1 ψs(t+1,h)
π∈Π[I≤k ] [I (t+1,h) ]
 
(t) (t) (t+1,h)
≤ dh s [Ien ] ; µ ◦t πs[Ien ] ◦t+1 ψs I (t+1,h) + 4ǫ. (54)
[ ]
(t)
Eq. (54) shows that πs[I en ]
has near-optimal probability for the endogenous component of s[I] near optimally
(when the rollout policy ψs(t+1,h) (t)
is fixed). Combined with the fact that both πs[I (t+1,h)
and ψs[I are
[I (t+1,h) ] en ] en ]
(t)
endogenous (by (A1)), this allows us to apply Lemma B.4, which asserts that πs[Ien ] reaches the any state
factor s[I] with Ien ⊆ I near-optimally as well. In particular,
 
max dh s [I] ; µ(t) ◦t π ◦t+1 ψs(t+1,h)
π∈Π[I≤k ] [I (t+1,h) ]
4
  
Ien ∈ I≤k I (t+1,h) since I⋆ ∈ I≤k I (t+1,h) and I≤k I (t+1,h) is a π-system by Lemma A.10.

38
 
(t) (t+1,h)
≤ dh s [I] ; µ(t) ◦t πs[I en ]
◦ t+1 ψs[I (t+1,h) ]
+ 4ǫ. (55)

b is ǫ/12k-approximate with respect to (Π [I≤k (I (t+1,h) )] , I≤k (I (t+1,h) ) , h) (cf.


Now, observe that since D
(A2)), Eq. (55) and Lemma A.6 imply that
 
b (t) (t+1,h)
max dh s [I] ; µ ◦t π ◦t+1 ψs I (t+1,h)
π∈Π[I≤k ] [ ]
 
b (t) (t) (t+1,h)
≤ dh s [I] ; µ ◦t πs[Ien ] ◦t+1 ψs I (t+1,h) + 5ǫ. (56)
[ ]

Since Ien = I ∩ I⋆ , and since


k−k′ +1
5ǫ ≤ (1 + 1/k) 5ǫ := ǫk′

for all k ′ ∈ [k], this implies that the condition at Line 5 of EndoFactorSelectionǫt,h is satisfied by I⋆ .

Step 2: Proof of first claim (Ib ⊆ I⋆ is a set of endogenous factors). Since EndoFactorSelectionǫt,h
b Γ(t) [I]).
does not return fail, it necessarily returns a pair (I, b We now show that Ib is endogenous. To do so,
we prove the following claim.
Lemma E.1. If I ∈ I≤k (I (t+1,h) ) satisfies the condition in Line 5 (sufficient_cover = true for I), then Ien
satisfies the condition as well (sufficient_cover = true for Ien ).
Conditioned on Lemma E.1, the result quickly follows. Observe that for any I ∈ I≤k (I (t+1,h) ), we have
Ien ∈ I≤k (I (t+1,h) )4 . Furthermore, if |Ien | > |I|, then EndoFactorSelection will check whether Ien satisfies
the condition in Line 5 prior to checking whether I satisfies it. Thus, EndoFactorSelection necessarily returns
a set of endogenous factors; it remains to prove Lemma E.1.
Proof of Lemma E.1. Fix I ∈ I≤k (I (t+1,h) ) with Iex 6= ∅. Assume that I satisfies the conditions in Line 5.
That is, for k1 := |I| ≤ k, it holds that for all J ∈ I≤k (I (t+1,h) ) and all s [J ] = (s [I] , s [J \ I]) ∈ S [J ],
 
max dbh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I(t+1,h)
(t+1,h) ]
π∈Π[I≤k ]
 
≤ dbh s [J ] ; µ(t) ◦t πs[J
(t)
∩I] ◦ t+1 ψ (t+1,h)
s[I (t+1,h) ]
+ ǫk1 , (57)

(t)
where πs[J ∩I] ∈ Γ [J ∩ I]. We will show that this implies that Ien also satisfies the conditions in Line 5.
(t)

Ien satisfies the conditions in Line 5. Since I satisfies Eq. (57) for all J ∈ I≤k (I (t+1,h) ), it must also
satisfy the condition for all Jen ⊆ J . Fix J ∈ I≤k (I (t+1,h) ). Then for all s[Jen ] ∈ S[Jen ], we have
 
max dbh s[Jen ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h)
(t+1,h) ]
π∈Π[I≤k ]
 
≤ dbh s [Jen ] ; µ(t) ◦t πs[J
(t)
en ∩I]
◦ t+1 ψ (t+1,h)
s[I (t+1,h) ]
+ ǫk1
(a)  
≤ dbh s [Jen ] ; µ(t) ◦t πs[J
(t)
en ∩Ien ]
(t+1,h)
◦t+1 ψs[I (t+1,h) ] + ǫk1 , (58)

where (a) follows because Jen ∩ I = Jen ∩ Ien .


b is ǫ/12k-approximate with respect to (Π [I≤k (I (t+1,h) )] , I≤k (I (t+1,h) ) , h), we can
Since (A2) asserts that D
relate the inequality above to the analogous inequality for the true occupancies using Lemma A.6. After
multiplying both sides by dh (s[Jex ]) ∈ [0, 1], this yields
 
(t+1,h)
dh (s[Jex ]) max dh s[Jen ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ]
π∈Π[I≤k ]

39
 
(t) (t+1,h)
≤ dh (s[Jex ]) dh s[Jen ] ; µ(t) ◦t πs[J en ∩Ien ]
◦t+1 ψs[I (t+1,h) ] + ǫk1 + ǫ/6k. (59)

We now manipulate both sides Eq. (58) to relate these quantities to the occupancy measure for s[J ]. This is
done by appealing to the decoupling property for occupancy measures of endogenous policies (Appendix B.2).
To begin, for the left-hand side of Eq. (59), we have
 
(t+1,h)
dh (s[Jex ]) max dh s[Jen ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ]
π∈Π[I≤k ]
(a)
 
(t+1,h)
= dh (s[Jex ]) max dh s[Jen ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ]
π∈Π[I⋆ ]
(b)
 
(t+1,h)
= max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ]
π∈Π[I⋆ ]
(c)
 
(t+1,h)
= max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] , (60)
π∈Π[I≤k ]

where relations (a) and (c) hold by Lemma B.3 and relation (b) holds by Lemma B.1; note that the assump-
(t+1,h)
tions of these lemmas hold because µ(t) and ψs[I (t+1,h) ] are assumed to be endogenous, and because π ∈ Π[I⋆ ]

is also endogenous. Moving on, we analyze the right-hand side of Eq. (59). We have
 
(t) (t+1,h)
dh (s[Jex ]) dh s[Jen ] ; µ(t) ◦t πs[J en ∩I en ] ◦ t+1 ψs[I (t+1,h) ]
 
(t) (t) (t+1,h)
= dh s[J ] ; µ ◦t πs[Jen ∩Ien ] ◦t+1 ψs[I (t+1,h) ] , (61)

(t) (t+1,h)
by Lemma B.1 (the assumptions of the lemma hold because µ(t) , πs[J en ∩Ien ]
and ψs[I (t+1,h) ] are endogenous).

Plugging Eq. (61) and Eq. (60) back into Eq. (59), we have that
 
(t+1,h)
max dh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ]
π∈Π[I≤k ]
 
(t) (t+1,h)
≤ dh s[J ] ; µ(t) ◦t πs[J en ∩Ien ]
◦ t+1 ψs[I (t+1,h) ]
+ ǫk1 + ǫ/6k. (62)

It remains to relate this to the analogous inequality for the approximate occupancy measures. Since D b is
ǫ/12k-approximate with respect to (Π [I≤k (I (t+1,h) )] , I≤k (I (t+1,h) ) , h) by (A2), Lemma A.6, and Eq. (62)
imply that
 
max dbh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h)
(t+1,h) ]
π∈Π[I≤k ]
 
≤ dbh s[J ] ; µ(t) ◦t πs[J
(t)
∩Ien ] ◦ t+1 ψ (t+1,h)
s[I (t+1,h) ]
+ ǫk1 + ǫ/3k
(a)  
≤ dbh s[J ] ; µ(t) ◦t πs[J
(t) (t+1,h)
∩Ien ] ◦t+1 ψs[I (t+1,h) ] + ǫk2 , (63)

where (a) holds for all k1 , k2 ∈ [k] such that k2 ≤ k1 − 1, since


k−k1 k−k2
ǫk1 + ǫ/3k := (1 + 1/k) 5ǫ + ǫ/3k ≤ (1 + 1/k) 5ǫ := ǫk2

by Lemma A.8 (with c = 5). Since |Ien | < |I| := k1 , we can set k2 = |Ien | in Eq. (63), which implies that
   
max dbh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I
(t+1,h)
(t+1,h) ] ≤ b
dh s[J ] ; µ (t)
◦ (t) (t+1,h)
t s[J ∩Ien ] t+1 s[I (t+1,h) ] + ǫ|Ien | .
π ◦ ψ (64)
π∈Π[I≤k ]

Since Eq. (64) holds for all J ∈ I≤k (I (t+1,h) ) and s [J ] ∈ S[J ], this yields the result.

40
Step 3: Proof of second claim (Γ(t) [I] b is near-optimal). This claim is a direct consequence of the
condition in Line 5. Let Ib be the output of EndoFactorSelectionǫt,h . Since sufficient_cover = true, then the
conditions at Line 5 are satisfied, and for all J ∈ I≤k (I (t+1,h) ), for all s [J ] = (s [I (t+1) ] , s [J \ I (t+1,h) ]) ∈
S [J ] :
 
max dbh s[J ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h)
(t+1,h) ]
π∈Π[I≤k ]
 
≤ dbh s [J ] ; µ(t) ◦t π (t) b ◦t+1 ψs[I
(t+1,h)
(t+1,h) ] + 15ǫ, (65)
s[J ∩I]


k−k
(t)
where πs[J ∩I] ∈ Γ [J ∩ I]; the upper bound holds because ǫk
(t)
′ := (1 + 1/k) 5ǫ ≤ 15ǫ for all k ′ ∈ [k].
Applying Eq. (65) with J ← I⋆ ∈ I≤k (I (t+1,h)
), and using Lemma A.6 (which is admissible by assumption
(A2)), we have that for all s [I⋆ ] ∈ S [I⋆ ],
 
(t+1,h)
max dh s[I⋆ ] ; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ]
π∈Π[I≤k ]
 
≤ dh s [I⋆ ] ; µ(t) ◦t π (t) b ◦t+1 ψs[I (t+1,h)
(t+1,h) ] + 16ǫ
s[I⋆ ∩I]
(a)  
(t) (t+1,h)
≤ dh s [I⋆ ] ; µ(t) ◦t πs[ b
I]
◦ t+1 ψs[I (t+1,h) ]
+ 16ǫ,

where (a) holds because I⋆ ∩ Ib = I,


b since Ib ⊆ I⋆ by the first claim.

41
F PSDP with Exogenous Information: ExoPSDP

Algorithm 7 ExoPSDP: PSDP with Exogenous Information


1: require:
• Target precision ǫ ∈ (0, 1) and failure probablitity δ ∈ (0, 1).
H
• Collection {Ψ(h) }h=2 of endogenous η/2-approximate policy covers.
2: initialize:  −2
• Let N = C · AS 4k H 2 k 3 log dSAH
δ ǫ for sufficiently large constant C > 0 and ǫ0 = ǫ
2S k H
.
• For all t ∈ [H], define µ(t) := Unf (Ψ(t) ).
b(H,H) = ∅.
• Let π
3: for t = H − 1, .., 1 do
/* Estimate average value functions via importance weighting. */
n oN
H
4: Get dataset (st,n , at,n , {rt′ ,n }t′ =1 ) b(t+1,H) .
by executing µ(t) ◦t Unf(A) ◦t+1 π
n=1
5: Estimate the (t → H) value for all π ∈ Π[I≤k ] via importance weighting:
!
1 X 1 {at,n = π (st,n )}
N H
X
Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) = rt′ ,n .
N n=1 1/A
t′ =t

/* Apply policy optimization with estimated


 value functions. */ 

6: b(t) ← EndoPolicyOptimizationǫt,h
π 0
Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) π∈Π[I≤k ]
.
7: b
π =π
(t,H)
b ◦t+1 π
(t)
b (t+1,H)
.
8: b(1,H) .
return: π

In this section we present and analyze the ExoPSDP algorithm (Algorithm 7). ExoPSDP is based on the
classical PSDP algorithm (Bagnell et al., 2004), but incorporates modifications to ensure that the policies
produced are endogenous. In Appendix F.1, we motivate ExoPSDP and state the main guarantee concerning
its performance (Theorem F.1). Then, in Appendix F.2, we prove this result.

F.1 Description of ExoPSDP


The ExoPSDP algorithm solves the following problem:
Given a collection of endogenous policy covers {Ψ(t) }H b that
t=1 for an ExoMDP M, find a policy π
is ǫ-optimal in the sense that J(b
π ) ≥ maxπ J(π) − ǫ.
To motivate the approach behind the algorithm, we first remind the reader of the classical PSDP algorithm.

Background on PSDP. Suppose we have a set of mixture policies {µ(h) }H h=1 that ensure good coverage at
every layer for an MDP M, and our goal is to optimize the MDP’s reward function. The PSDP algorithm
(Bagnell et al., 2004) addresses this problem by using the dynamic programming principle to learn a near-
optimal policy through a series of backward steps t = H, . . . , 1. Assume access to a policy class Π. At
each step t, assuming that step t + 1 has already produced a near-optimal (t + 1) → H policy π b(t+1,H) , the
bt+1:H ) for all π ∈ Π where (see also Eq. (38))
algorithm estimates the value function Vt,H (µ(t) ◦t π ◦t+1 π
" H
#
X
Vt,H (µ (t)
bt+1:H ) := Eµ(t) ◦t π◦t+1 πbt+1:H
◦t π ◦t+1 π rt′ .
t′ =t

The estimates are calculated via importance-weighting by


!
1 X 1 {at,n = π (st,n )}
N H
X
Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) = rt′ ,n
N n=1 1/A
t′ =t

42
where the data is generated by rolling in with µ(t) , taking random action on the tth time-step and rolling out
bt+1:H using N trajectories. Then, PSDP computes
with π

π (t) ∈ argmax Vbt,H (µ(t) ◦t π ◦t+1 π


bt+1:H ) , (66)
π∈Π

b(t,h) = π (t) ◦t π
and sets π b := π
b(t+1,H) . The final policy π b(1,H) is guaranteed to be near-optimal as long as
(h) H
{µ }h=1 have good coverage.

Insufficiency of vanilla PSDP. The first issue with applying PSDP to the ExoMDP model is that, if we
d
want the policy class Π to contain all possible policies, we will have |Π| = Θ(AS ), which leads to sample
complexity scaling with log|Π| = Ω(poly(S d )); this is prohibitively large. An alternative policy class one my
k
hope can address this issue is Π[I≤k ]. Indeed, this class has much smaller cardinality: |Π[I≤k ]| = Θ(dk AS ).
However, for an ExoMDP, naively optimizing over this class via Eq. (66) may lead to roll-out policies π bt+1:H
that depend on the exogenous state factors, since there is no mechanism in place to ensure endogeneity. This
in turn may invalidate the realizability assumption needed to apply standard PSDP (see Misra et al. (2020),
Assumption 2). In particular, PSDP requires that the policy class Π contains the optimal policy in the sense
that

max Vt,H (µ(t) ◦t π ◦t+1 π


bt+1:H ) = max Vt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) . (67)
π∈ΠNS π∈Π

bt+1:H depends on the exogenous state factors, then the optimal policy that maximizes
If the roll-out policy π
bt+1:H ) may depend on exogenous state factors as well. Then, Eq. (67) may be violated
Vt,H (µ(t) ◦t π ◦t+1 π
when instantiating PSDP with the policy class Π[I≤k ].

A solution: ExoPSDP. To address the issues above, ExoPSDP applies an alternative to the optimization
step in (66). In particular, ExoPSDP uses the sub-routine EndoPolicyOptimization (see Line 6), which finds an
endogenous near-optimal policy. In particular, as long as π b(t+1,H) is endogenous, which can be guaranteed
inductively, EndoPolicyOptimization, will succeed in finding an endogenous policy at step t. Importantly, since
(i) the reward in a ExoMDP depends only on the endogenous factors, and (ii) the policy π b(t+1,H) is endogenous
(by the guarantees of EndoPolicyOptimization), πb can be shown to be near-optimal with respect to the entire
(t)

policy class Π. Hence, in spite of optimizing over the restricted policy class I≤k , we are able to find a
near-optimal policy with respect set of all policies. Using this argument inductively allows us to prove that
b(1,H) is near-optimal and endogenous.
π
H
Theorem F.1 (Main guarantee for ExoPSDP). Suppose that the sets {Ψ(t) }t=1 passed into ExoPSDP are
endogenous η/2-approximate policy covers for all t. Then, for any ǫ, δ > 0, with probability at least 1 − δ,
b(1,H) is endogenous.
1. π
b(1,H) is ǫ-optimal in the sense that
2. π

π (1,H) ) + ǫ.
max J(π) ≤ J(b
π∈ΠNS

 AH 4 k3 S 3k log 
( dAH
δ )
Furthermore, the algorithm uses at most N = O ǫ2 trajectories.

F.2 Proof of Theorem F.1



Fix a pair of endogenous policies π, πb ∈ ΠNS [I⋆ ]. Further, let Men = S, A, Ten , Rs[I⋆ ] , H, d1,en denote
the restriction of the ExoMDP to its endogenous component, and let Qπt,en(s[I⋆ ], a) denote the associated
state-action value function for Men .
We decompose the difference in performance as follows.

J(π) − J(b
π)

43
(a)
h
X h i
= Eπ Qπt,en
b
(st [I⋆ ] , πt (st [I⋆ ]) − Qπt,en
b
bt (st [I⋆ ]))
(st [I⋆ ] , π
t=1
h
X h i
≤ Es[I⋆ ]∼dt (· ; π) max Qπt,en
b
(s [I⋆ ] , a) − Qπt,en
b
bt (s [I⋆ ]))
(s [I⋆ ] , π
a
t=1
h
X X  
≤ max dt (s [I⋆ ] ; π) max Qπt,en
b
(s [I⋆ ] , a) − Qπt,en
b
bt (s [I⋆ ])) .
(s [I⋆ ] , π
π∈Π[I⋆ ] a
t=1 s[I⋆ ]∈S[I⋆ ]

(b) h
X X  
≤ 2S k dt (s [I⋆ ] ; µ(t) ) max Qπt,en
b
(s [I⋆ ] , a) − Qπt,en
b
bt (s [I⋆ ])) .
(s [I⋆ ] , π
a
t=1 s[I⋆ ]∈S[I⋆ ]
h
X h i
= 2S k Eµ(t) max Qπt,en
b
(st [I⋆ ] , a) − Qπt,en
b
bt,en (st [I⋆ ])) .
(st [I⋆ ] , π
a
t=1

(c)
h
X h i
= 2S k ′
max Eµ(t) Q π
b
t,en (s t [I⋆ ] , π ′
(s t [I⋆ ])) − Q π
b
t,en (s t [I⋆ ] , b
πt (s t [I⋆ ])) .
π ∈Π[I⋆ ]
t=1
h
X
= 2S k max Vt,H (µ(t) ◦t π ′ ◦t+1 π
b) − Vt,H (µ(t) ◦t π
b(t) ◦t+1 π
b) . (68)
π ′ ∈Π[I⋆ ]
t=1

The key steps above are justified as follows:


• Relation (a) holds by the performance difference lemma for endogenous policies (Lemma B.6), since
b ∈ ΠNS [I⋆ ] by assumption.
both π, π
• Relation (b) holds because
maxπ∈Π[I⋆ ] dt (· ; π)
≤ 2S k ,
dt (· ; µ(t) )
h−1
which is a consequence of Lemma A.2. In particular, we use that (i) {Ψ(t) }t=1 are endogenous η/2-
approximate policy covers, (ii) for all states, either maxπ∈Π[I⋆ ] dt (s[I⋆ ] ; π) ≥ η or maxπ∈Π[I⋆ ] dt (s[I⋆ ] ; π) =
0 (by the reachability assumption), and (iii)
max Qπt,en
b
(s [I⋆ ] , a) − Qπt,en
b
bt (s [I⋆ ])) ≥ 0.
(s [I⋆ ] , π
a

• Relation (c) holds by the skolemization principle (Lemma A.9).


Let GExoPSDP denote the success event for Lemma F.1 (stated and proven in the sequel), which is the event
in which for all t ∈ [H], EndoPolicyOptimizationǫt,h
0
b(t) such that
returns a policy π
b(t) is endogenous.
1. π
b(t) is near-optimal in the following sense:
2. π
max Vt,H (µ(t) ◦t π ′ ◦t+1 π
b(t+1,H) ) − Vt,H (µ(t) ◦t π
b(t) ◦t+1 π
b(t+1,H) ) ≤ ǫ0 . (69)
π ′ ∈Π[I⋆ ]

 AH 2 k3 S k log dAH 
( δ )
Lemma F.1 asserts that GExoPSDP holds with probability at least 1 − δ whenever N = Ω ǫ20
.
b(1,H) is endogenous. To show that the policy is near-
Conditioning on GExoPSDP , it follows immediately that π
optimal, we apply Eq. (68) with π b=π b(1,H) and bound each term in the sum using Eq. (69). Maximizing
over π ∈ ΠNS [I⋆ ] yields
π ) ≤ 2S k Hǫ0 = ǫ,
max J(π) − J(b
π∈Π[I⋆ ]

by the choice ǫ0 := ǫ/2S k H. Finally, by the fact that maxπ∈Π[I⋆ ] J(π) = maxπ∈ΠNS J(π), which holds
because the reward is endogenous (Efroni et al. (2021b), Proposition 5), we conclude the proof.

44
F.3 Computational Complexity of ExoPSDP
We now show that ExoPSDP can be implemented with computational complexity of

O dk N S k AH ,

where N is the number of trajectories. The main computational bottleneck of ExoPSDP occurs at Line 5
of EndoPolicyOptimizationǫt,h
0
. There, we need to optimize over Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) estimated by the
empirical averages (Line 5) for all I ∈ Ik . Meaning,

max Vbt,H (µ(t) ◦t π ◦t+1 π


bt+1:H )
π∈Π[I]

To sketch how to do this efficiently, we first show how to optimize over the set Π[I] when a factor set I is
fixed. We show that instead of enumerating over all policies, one can optimize Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) as
follows. Observe that
X (t)
Vbt,h (µ(t) ◦t π ◦t+1 π
bt+1:H ) = b µ ◦t π◦t+1 πbt+1:H (s[I], π (s[I])) ,
Q t,h
s[I]∈S[I]

where we note that |S[I]| ≤ S k , and where


N h
!
(t) 1 X X
b µ ◦t π◦t+1 πbt+1:H
Q t,h (s[I], a) := 1{st [I] = s[I], at = a} rn,t′ .
N n=1 ′ t =t

To maximize Vbt,h (µ(t) ◦t π ◦t+1 π bt+1:H ) it suffices to maximize each individual function
b µ(t) ◦t π◦t+1 π
bt+1:H
Qt (s[I], a). Letting
(t)
b µt
bI (s[I]) ∈ argmax Q
π
◦t π◦t+1 π
bt+1:H
(s[I], a) ,
a

we have that

max Vbt,h (µ(t) ◦t π ◦t+1 π


bt+1:H ) = Vbt,h (µ ◦t π
bI ◦t+1 ψ) .
π∈Π[I]

bI (s[I]) ∈ Π[I].
Furthermore, observe that π
This shows b bt+1:H ) with computational complexity
 that it is possible to solve max π∈Π[I] Vt,h (µ ◦t π ◦t+1 π
(t)

k ǫ
O N S A . Since EndoPolicyOptimizationt,h optimizes over all possible factor sets I ∈ I≤k where |I≤k | =
 
O dk for H times the total computational complexity is O dk N S k AH .

F.4 Application of EndoPolicyOptimization within ExoPSDP


In this section we state and prove Lemma F.1, which shows that the application of EndoPolicyOptimization
within ExoPSDP (Line 6) is admissible, in the sense that the preconditions required by the algorithm are
satisfied.
Lemma F.1 (Guarantees of EndoPolicyOptimization for ExoPSDP). Let precision parameter ǫ ∈ (0, 1) and
failure probability δ ∈ (0, 1) be given. Assume that the mixture policies µ(t) ∈ Πmix used in Algorithm 7 are
 AH 2 k3 S k log dAH 
( δ )
endogenous for all t. Then, if N = Ω ǫ2 trajectories are used for each layer, we have that
with probability at least 1 − δ, for all t:
b(t) is an endogenous policy.
1. π
b(t) is near-optimal in the sense that
2. π

max Vt,H (µ(t) ◦t π ′ ◦t+1 π


b(t+1,H) ) − Vt,H (µ(t) ◦t π
b(t) ◦t+1 π
b(t+1,H) ) ≤ 4ǫ.
π ′ ∈Π[I⋆ ]

45
Proof of Lemma F.1. Let G (t) denote the event in which
b(t) is an endogenous policy.
1. π
b(t) is near optimal:
2. π

max Vt,H (µ(t) ◦t π ′ ◦t+1 π


b(t+1,H) ) − Vt,H (µ(t) ◦t π
b(t) ◦t+1 π
b(t+1,H) ) ≤ 4ǫ.
π ′ ∈Π[I⋆ ]

We will prove that for any δ > 0,


(t′ )

P G (t) | ∩H
t′ =t+1 G ≥ 1 − δ, . (70)
 AH 2 k3 S k log 
( dAH
δ )
as long at least Ω ǫ2 trajectories are used at layer t. Whenever Eq. (70) holds, Lemma A.4
implies that
(t′ )

P ∩H
t=1 G ≥ 1 − Hδ, (71)

and scaling δ ← δ/H concludes the proof.


We now prove that Eq. (70) holds. To do so, we apply Theorem D.1 and verify that assumptions (A1) and
(A2) required by it hold.

(A1) Conditioning on the event ∩Ht′ =t+1 G
(t )
b(t+1,H) is an endogenous policy. In addition µ(t)
, we have that π
is an endogenous policy and the reward function is endogenous by assumption. Thus, the conditions
of Lemma B.7 are satisfied, and the restriction property holds:

max Vt,h (µ ◦t π ◦t+1 ψ) = max Vt,h (µ ◦t π ◦t+1 ψ) .


π∈Π[I] π∈Π[Ien ]

(A2) The proof of this result uses similar arguments to Lemma A.5 . Fix π ∈ Π[I≤k ] and observe that
Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) is an unbiased estimator for Vt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ), and is bounded by
AH. Using Lemma A.3 and following the same steps as in the proof of Lemma A.5, we have that with
probability at least 1 − δ,

Vbt,H (µ(t) ◦t π ◦t+1 π


bt+1:H ) − Vt,H (µ(t) ◦t π ◦t+1 π
bt+1:H )
s   
AH 2 log 1δ AH log 1δ
≤O  + .
N N
 k

Taking a union bound over all π ∈ Π[I≤k ] and using that |Π[I≤k ]| ≤ O dk+1 AS , we have that with
probability at least 1 − δ,

Vbt,H (µ(t) ◦t π ◦t+1 π


bt+1:H ) − Vt,H (µ(t) ◦t π ◦t+1 π
bt+1:H )
s   
AH 2 kS k log dA AHkS k log dA
≤ O δ
+ δ .
N N

δ )
AH 2 k3 S k log( dA 
Hence, setting N = Ω ǫ2 and using that ǫ2 ≤ ǫ for ǫ ∈ (0, 1), we have that with
probability at least 1 − δ, for all π ∈ Π[I≤k ],
ǫ
Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) − Vt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) ≤ .
12k

46
Part III

Additional Details and Proofs for Main Results


G OSSR Description and Proof of Theorem 3.1
In this section we present and analyze the full OSSRǫ,δh algorithm (Algorithm 8). The algorithm may be
thought of as a sample-based version of the OSSR.Exact algorithm described in Section 3.2. While OSSR.Exact
assumes exact access to state occupancy measures, OSSRǫ,δ h estimates the occupancy measures in a data-
driven fashion, which introduces the need to account for statistical errors.
This section is organized as follows. First, in Appendix G.1 we give a high-level overview of the algorithm
design principles behind OSSRǫ,δh . Then, in Appendix G.2, we prove the main result concerning its perfor-
mance, Theorem 3.1. Appendices G.3 and G.4 contain proofs for supporting results used in the proof of
Theorem 3.1.

G.1 OSSR: Algorithm Overview


The OSSRǫ,δ h algorithm follows the same template as OSSR.Exact: For each h ∈ [H], given policy covers
Ψ(1) , . . . , Ψ(h−1) , the algorithm builds a policy cover Ψ(h) for layer h in a backwards fashion using dynamic
programming. There are two differences from the exact algorithm. First, we only have sample access
to the underlying ExoMDP, the algorithm estimates the relevant occupancy measures for each backward
step using Monte Carlo rollouts and importance weighting. Second, the optimization and selection phases
from OSSR.Exact are replaced by error-tolerant variants given by the subroutines EndoPolicyOptimization and
EndoFactorSelection (Algorithm 5 in Appendix D and Algorithm 6 in Appendix E, respectively).

State occupancy estimation. In order to apply dynamic programming in the same fashion as OSSR.Exact,
each backward step 1 ≤ t ≤ h − 1 of OSSRǫ,δ h proceeds by building estimates for the layer-h occupancies
dh (s[I] ; µ(t) ◦t π ◦t+1 ψ (t+1,h) ) for all I ∈ I≤k , π ∈ Π[I≤k ] and ψ (t+1,h) ∈ Ψ(t+1,h) . This is accomplished
through Monte Carlo: We gather trajectories by running µ(t) up to layer t, sampling at ∼ Unf(A) uniformly,
then sampling ψ (t+1,h) ∼ Unf(Ψ(t+1,h) ) and using it to roll out from layer t + 1 to h. We then build estimates
by importance weighting the empirical frequencies. We appeal to uniform convergence to ensure that the
estimated occupancies are uniformly close for all I ∈ I≤k and π ∈ Π[I≤k ]; this argument critically uses
that |Ψ(t+1,h) | ≤ S k and log|Π[I≤k ]| ≤ O kS k log (dA) , as well as the fact that we only require convergence
for factors of size at most k.

Error-tolerant backward state refinement. Given the estimated state occupancy measures above,
each backward step 1 ≤ t ≤ h − 1 of OSSRǫ,δ h follows the general optimization-selection template used in
OSSR.Exact. For the optimization step (Line 7), it applies the subroutine EndoPolicyOptimizationǫt,h (Algo-
rithm 5 in Appendix D), which finds a collection of endogenous “one-step” policy covers (Γ(t) [I])I∈I≤k (I (t+1,h) ) ,
(t)
which have the property that for all I ∈ I≤k (I (t+1,h) ) and s ∈ S, the t → h policy πs[I] ◦ ψs(t+1,h) (approx-
[I (t+1,h) ]
imately) maximizes the probability that sh [I] = s[I]. Then, at selection step (Line 9), OSSRǫ,δ h applies the
subroutine EndoFactorSelectionǫt,h (Algorithm 6 in Appendix E), which selects a single factor set I (t,h) ⊆ I⋆
such that—by choosing Ψ(t,h) to be the composition of Γ(t) [I (t,h) ] and Ψ(t+1,h) —we obtain an (approximate)
t → h policy cover.
Full descriptions and proofs of correctness for EndoPolicyOptimizationǫt,h and EndoFactorSelectionǫt,h are given in
Appendix D and Appendix E. Briefly, both subroutines are based on approximate versions of the constraints
used in the optimization and selection phase for OSSR.Exact (Line 5 and Line 7 of Algorithm 2), but ensuring
endogeneity of the resulting factors is more challenging due to approximation errors, and it no longer suffices
to simply search for the factor set with minimum cardinality. Instead, we search for factor sets that satisfy
approximate versions of Line 5 and Line 7 with an additive regularization term based on cardinality. We

47
Algorithm 8 OSSRǫ,δ h : Optimization-Selection State Refinement
1: require:
• Timestep h, precision parameter ǫ > 0, failure probability δ ∈ (0, 1).
h−1
• Policy covers {Ψ(t) }t=1 for steps 1, . . . , h − 1.
• Upper bound k ≥ 0 on the cardinality of I⋆ .
2: initialize:
• Let I (h,h) ← ∅ and Ψ(h,h) ← ∅.
 −2
• Define N = CAS 4k H 2 k 3 log dSAH
δ ǫ for sufficiently large constant C > 0, and let ǫ0 := ǫ
2S k H
.
3: for t = h − 1, h − 2, .., 1 do
Estimate occupancy measures
N
4: Collect dataset {(st,n , at,n , ψn(t+1,h) , sh,n )}n=1 by drawing N trajectories from the process:
• Execute µ(t) := Unf(Ψ(t) ) up to layer t (resulting in state st,n ).
• Sample action at,n ∼ Unf(A) and play it, transitioning to st+1,n in the process.
• Sample ψn(t+1,h) ∼ Unf(Ψ(t+1,h) ) and execute it from layers t + 1 to h (resulting in sh,n ).
5: For each I ∈ I≤k , π ∈ Π[I≤k ], and ψ (t+1,h) ∈ Ψ(t+1,h) , define

1 X 1{at,n = π(st,n ), ψn(t+1,h) = ψ (t+1,h) , sh,n [I] = s[I]}


N
dbh (s[I] ; µ(t) ◦t π ◦t+1 ψ (t+1,h) ) = .
N n=1 (1/|A|) · (1/|Ψ(t+1,h) |)
n o
6: b (t,h) := dbh (· ; µ(t) ◦t π ◦t+1 ψ (t+1,h) ) | π ∈ Π(I≤k ), ψ (t+1,h) ∈ Ψ(t+1,h) ) .
Let D
Phase I: Optimization (Algorithm 5 in Appendix D)
(t)
// Beginning from any state at layer t, πs[I] ◦t+1 ψ(t+1,h)

(t+1,h)
 maximizes probability that sh [I] = s[I].
s I
7: For each I ∈ I≤k (I (t+1,h) ) and s [I] ∈ S [I], let
n  o 
(t)
πs[I] ← EndoPolicyOptimizationǫt,h
0
dbh s[I] ; µ(t) ◦t π ◦t+1 ψs[I
(t+1,h)
t+1,h ] .
π∈Π[I≤k ]

 (t)
8: Let Γ(t) [I] := πs[I] | s[I] ∈ S[I] .
Phase II: Selection (Algorithm 6 in Appendix E)
(t,h) (t)
 (t,h)  (t+1,h)

// Find factor set I ⊆ I⋆ such that Γ I  has good coverage for all factors in I≤k I .
9: (I (t,h)
, Γ [I
(t) (t,h)
]) ← EndoFactorSelectionǫt,h
0 b (t,h) .
{Γ(t) [I]}I∈I≤k (I (t+1,h) ) ; I (t+1,h) , Ψ(t+1,h) , D
Policy composition
10: b then for each s[I (t,h) ] ∈ S[I (t,h) ] define ψ (t,h)(t,h) := π (t) (t,h) ◦t ψ (t+1,h)
Let I (t,h) ← I, .
n o s[I ] s[I ] s[I (t+1,h) ]
(t,h)
11: Let Ψ(t,h) ← ψs[I (t,h) ] : s[I (t,h) ] ∈ S[I (t,h) ] .

12: return Ψ(h) := Ψ(1,h) . // Policy cover for timestep h.

48
show that as long as this penalty is carefully chosen as a function of the statistical error in the occupancy
estimates, the resulting factor sets will be endogenous while inducing sufficient amount of exploration (with
high probability).
In Appendix C, we provide a general template for designing error-tolerant algorithms that search for en-
dogenous factors using the approach described; both EndoPolicyOptimizationǫt,h and EndoFactorSelectionǫt,h are
special cases of this template.

G.2 Proof of Theorem 3.1


We now restate and prove Theorem 3.1, which shows that OSSRǫ,δ h learns an endogenous ǫ-optimal policy
cover with sample complexity depending only logarithmically on the number of factors d.
h−1
Theorem 3.1 (Sample complexity of OSSR). Suppose that OSSRǫ,δ h is invoked with {Ψ }t=1 , where each
(t)

Ψ(t) is an endogenous, η/2-approximate policy cover for layer t. Then with probability at least 1 − δ, the set
Ψ(h) returned by OSSRǫ,δ
h is an endogenous ǫ-approximate
 −2  policy cover for layer h, and has |Ψ | ≤ S . The
(h) k
4k 2 3 dSAH
algorithm uses at most O AS H k log δ ·ǫ episodes.
Proof of Theorem 3.1. We begin by defining a success event for ExoRL.
Definition G.1 (Success of OSSR at the layer h). G (h) is defined as the event in which the following properties
hold:
1. Ψ(h) is an endogenous η/2-approximate policy cover for layer h.
2. I (h) contains only endogenous factors.
In addition, we define G (<h) = ∩h−1

h′ =1 G . The following intermediate result—proven in the sequel (Ap-
(h )

pendix G.3)—serves as our starting point.


Theorem G.1 (Success of State Refinement). Fix h ∈ [H] and condition on G (<h) . Then, for any ǫ > 0
(recalling that ǫ0 := 2Sǫk H ), by setting
   
dSAH
2k 3
N = Θ AS k log · ǫ−2
0 ,
δ
ǫ,δ
OSSRh guarantees that with probability at least 1 − δ, for all t ≤ h,
1. I (t,h) ⊆ I⋆ , and Ψ(t,h) contains only endogenous policies.
2. For all s ∈ S,
   
(t+1,h) (t,h)
max dh s[I⋆ ]; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] − dh s[I⋆ ]; µ(t) ◦t ψs[I (t,h) ] ≤ ǫ0 , (72)
π∈Π[I⋆ ]

(t,h) (t+1,h)
where we recall that ψs[I (t,h) ] ∈ Ψ
(t,h)
and ψs[I (t+1,h) ] ∈ Ψ
(t+1,h)
.
We now show that conditioned on the event in Theorem G.1, the set Ψ(h) is an endogenous, ǫ-approximate
policy cover (as long as ǫ0 is chosen to be sufficiently small). In particular, we will show that for all
s[I⋆ ] ∈ S[I⋆ ] there exists a policy ψ ∈ Ψ(h) such that

max dh (s[I⋆ ]; π) ≤ dh (s[I⋆ ]; ψ) + ǫ. (73)


π∈ΠNS [I⋆ ]

Fix s[I⋆ ] ∈ S[I⋆ ]. From first part of Theorem G.1, we have that I (1,h) ⊆ I⋆ , so we can write s[I⋆ ] =
(1,h)
(s[I (1,h) ], s[I⋆ \ I (1,h) ]) . We will show that the policy ψs[I (1,h) ] ∈ Ψ
(h)
= Ψ(1,h) maximizes the probability of
reaching s[I⋆ ] ∈ S[I⋆ ] in the sense of Eq. (73).
Define a endogenous “reward function” Rs[I⋆ ] , with

Rs[I⋆ ],h (sh [I⋆ ]) := 1 {sh [I⋆ ] := s [I⋆ ]}

49
and Rs[I⋆ ],t (·) := 0 for t 6= h. Letting rs[I⋆ ],t := Rs[I⋆ ],t (st [I⋆ ]), we can write
" h
#
X
dh (s [I⋆ ] ; π) := Eπ rs[I⋆ ],t . (74)
t=1

That is, we can viewthe state occupancy dh (s [I⋆ ] ; π) as the state value function for the ExoMDP M :=
S, A, T, Rs[I⋆ ] , H, d1 . Let π ∈ ΠNS [I⋆ ] be an endogenous policy. We let Men = S, A, Ten , Rs[I⋆ ] , H, d1,en
denote the endogenous component of this MDP, and let Qπt,en (s[I⋆ ], a) denote the associated state-action
value function for Men .
To proceed, we use the representation above within the performance difference lemma (Lemma B.6) to bound
(1,h)
the suboptimality of ψs[I (1,h) ] by a sum of "per-step" errors for each of the backward steps. In particular for

any pair of endogenous policies π, ψ ∈ ΠNS [I⋆ ], Lemma B.6 implies that

dh (s [I⋆ ] ; π) − dh (s [I⋆ ] ; ψ)
(a) X
h h i
= Eπ Qψ ψ
t,en (st [I⋆ ] , πt (st [I⋆ ]) − Qt,en (st [I⋆ ] , ψt (st [I⋆ ]))
t=1
h
X h i
≤ Es[I⋆ ]∼dt (· ; π) max Qψ ψ
t,en (s [I⋆ ] , a) − Qt,en (s [I⋆ ] , ψt (s [I⋆ ]))
a
t=1
h
X X  
≤ max dt (s [I⋆ ] ; π ′ ) max Qψ ψ
t,en (s [I⋆ ] , a) − Qt,en (s [I⋆ ] , ψt (s [I⋆ ])) .
π ′ ∈ΠNS [I⋆ ] a
t=1 s[I⋆ ]∈S[I⋆ ]

(b) h
X X  
≤ 2S k dt (s [I⋆ ] ; µ(t) ) max Qψ ψ
t,en (s [I⋆ ] , a) − Qt,en (s [I⋆ ] , ψt (s [I⋆ ]))
a
t=1 s[I⋆ ]∈S[I⋆ ]
h
X h i
= 2S k Es[I⋆ ]∼dt (· ; µ(t) ) max Qψ
t,en (s [I⋆ ] , a) − Q ψ
t,en (s [I⋆ ] , ψt (s [I⋆ ]))
a
t=1
h 
X 
(c) k (t) ′ (t)
= 2S max dh (sh [I⋆ ] ; µ ◦t π ◦t+1 ψ) − dh (sh [I⋆ ] ; µ ◦t ψ) . (75)
π ′ ∈Π[I⋆ ]
t=1

We justify the steps above as follows:


• The equality (a) follows from Lemma B.6).
• Relation (b) holds because
maxπ∈Π[I⋆ ] dt (· ; π)
≤ 2S k
dt (· ; µ(t) )
which is a consequence of Lemma A.2. In particular, we use that (i) {Ψ(t) }h−1t=1 are endogenous η/2-
approximate policy covers, (ii) either maxπ∈Π[I⋆] dt (s[I⋆ ] ; π) ≥ η or maxπ∈Π[I⋆] dt (s[I⋆ ] ; π) = 0 for
all s[I⋆ ] ∈ S[I⋆ ] by the reachability assumption, and (iii) nonnegativity:

max Qψ ψ
t,en (s [I⋆ ] , a) − Qt,en (s [I⋆ ] , ψt (s [I⋆ ])) ≥ 0.
a

• Relation (c) holds by the skolemization principle (Lemma A.9) and the tower rule for conditional
probabilities.
Recall that the event defined in Theorem G.1 (Eq. (72)) implies that for all t ≤ h,

max dh (sh [I⋆ ] ; µ(t) ◦t π ′ ◦t+1 ψs(t+1,h) − dh (sh [I⋆ ] ; µ(t) ◦t ψs(t,h) ) ≤ ǫ0 .
π ′ ∈Π[I⋆ ] [I (t+1,h) ] [I (t,h) ]

50
Plugging this bound into Eq. (75) with ψ ← ψs(1,h) , we have that for all endogenous policies π,
[I (1,h) ]

dh (s [I⋆ ] ; π) − dh (s [I⋆ ] ; ψs(1,h) ) ≤ 2S k Hǫ0 .


[I (1,h) ]

By using that ǫ0 := ǫ/2S k H and taking the maximum with respect to π ∈ Π[I⋆ ], we conclude that for all
s [I⋆ ] = (s [I (1,h) ] , s [I⋆ \ I (1,h) ]), the policy ψs(1,h) satisfies
[I (1,h) ]

max dh (s [I⋆ ] ; π) − dh (s [I⋆ ] ; ψs(1,h) ) ≤ ǫ. (76)


π∈Π[I⋆ ] [I (1,h) ]

This establishes that the set Ψ(h) is an endogenous ǫ-approximate


 −2  policy cover. With this choice for ǫ0 ,
the total sample complexity is O AS 4k H 2 k 3 log dSAH
δ · ǫ . Finally, we note that as a consequence of
Theorem G.1, we have I (1,h) ⊆ I⋆ as desired. We have |Ψ(h) | ≤ S k by construction.

G.3 Proof of Theorem G.1 (Success of State Refinement Step)


In this section we prove Theorem G.1, a supporting result used in the proof of Theorem 3.1. The result
shows for each step t, the optimization and selection phases in OSSRǫ,δ
h lead to a set of endogenous t → h
policies Ψ(t,h) , as long as certain preconditions are satisfied.
Theorem G.1 (Success of State Refinement). Fix h ∈ [H] and condition on G (<h) . Then, for any ǫ > 0
(recalling that ǫ0 := 2Sǫk H ), by setting
   
dSAH
N = Θ AS 2k k 3 log · ǫ−2
0 ,
δ
ǫ,δ
OSSRh guarantees that with probability at least 1 − δ, for all t ≤ h,
1. I (t,h) ⊆ I⋆ , and Ψ(t,h) contains only endogenous policies.
2. For all s ∈ S,
   
(t+1,h) (t,h)
max dh s[I⋆ ]; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] − dh s[I⋆ ]; µ(t) ◦t ψs[I (t,h) ] ≤ ǫ0 , (72)
π∈Π[I⋆ ]

(t,h) (t+1,h)
where we recall that ψs[I (t,h) ] ∈ Ψ
(t,h)
and ψs[I (t+1,h) ] ∈ Ψ
(t+1,h)
.

Proof of Theorem G.1. The event G (<h) (Definition G.1) holds by assumption, which implies that the
policy sets Ψ(t) for t ∈ [h − 1] contain only endogenous policies. As a result,

µ(t) := Unf (Ψ(t) ) . (77)

is an endogenous mixture policy. To proceed, we define some intermediate success events which will be used
throughout the proof. First, for t ≤ h define

G1(t,h) := {Ψ(t,h) contains only endogenous policies, and I (t,h) ⊆ I⋆ }.

Observe when G1(t,h) holds, we can express all states s[I⋆ ] ∈ S[I⋆ ] as

s[I⋆ ] = (s[I (t,h) ], s[I⋆ \ I (t,h) ]) = (s[I (t+1,h) ], s[I⋆ \ I (t+1,h) ]) ,

since I (t+1,h) ⊆ I (t,h) ⊆ I⋆ . Next, we define an event G2(t,h) via

G2(t,h) :=
 
(t+1,h) (t,h)
∀s[I⋆ ] ∈ S[I⋆ ] : max dh (s[I⋆ ]; µ(t) ◦t π ◦t+1 ψs[I (t+1,h) ] ) − dh (s[I⋆ ]; µ (t)
◦ ψ
t s[I (t,h) ] ) ≤ ǫ 0 ,
π∈Π[I⋆ ]

51
(t+1,h) (t,h)
where we recall that ψs[I (t+1,h) ] ∈ Ψ
(t,h)
and ψs[I (t,h) ] ∈ Ψ
(t,h)
. Finally, let G (t,h) := G1(t,h) ∩ G2(t,h) . We will
prove that for all t ≤ h,
′ 
P G (t,h) | ∩ht′ =t+1 G (t ,h) , G (<h) ≥ 1 − δ/H. (78)


Taking a union bound (Lemma A.4), this implies that P ∩ht′ =1 G (t ,h) | G (<h) ≥ 1 − δ, which establishes
Theorem G.1.


Proving Eq. (78). Let t < h be fixed, and condition on ∩ht′ =t+1 G (t ,h) and G (<h) . We will show that
whenever these events hold and the estimated occupancy measures have sufficiently high accuracy, G (t,h)
holds. Formally, recalling Definition A.1, define an event
n o
(t,h)
Gstat = Db is ǫ0 -approximate with respect to (µ(t) ◦t Π[I≤k ] ◦t+1 Ψ(t+1,h) , I≤k (I (t+1,h) ) , h) . (79)
12k

Our goal is to show that conditioned on ∩ht′ =t+1 G (t ,h) and G (<h) , Gstat
(t,h)
=⇒ G (t,h) , so that

′  ′  (a)
P G (t,h) | ∩ht′ =t+1 G (t ,h) , G (<h) ≥ P Gstat
(t,h)
| ∩ht′ =t+1 G (t ,h) , G (<h) ≥ 1 − δ.

Here (a) is a consequence of Lemma A.5, which asserts that by setting


   
dSA
N = Ω AS 2k k 3 log · ǫ−2
0 , (80)
δ

the estimated state occupancies D b produced in Line 5 of OSSRǫ,δ are ǫ0 /12k-approximate with respect to
h
(µ(t) ◦t Π[I≤k ] ◦t+1 Ψ(t+1,h) , I≤k (I (t+1,h) ) , h), in the sense of Definition A.1. We formally verify that the
preconditions required to apply Lemma A.5 are satisfied at the end of the proof for completeness.

We now prove that conditioned on ∩ht′ =t+1 G (t ,h) and G (<h) , Gstat
(t,h)
=⇒ G (t,h) . This relies on two claims:
Success of EndoPolicyOptimization and success of EndoFactorSelection.

Success of EndoPolicyOptimizationǫt,h
0
. We appeal to Lemma G.1, verifying that the assumptions it requires,

(A1) and (A2), are satisfied (conditioned on ∩ht′ =t+1 G (t ,h) and G (<h) ).
(A1) µ(t) is an endogenous policy when G (<h) holds (see Eq. (77)) and Ψ(t+1,h) contains only endogenous
policies whenever G (t+1,h) holds.
b is ǫ0 /12k-approximate with respect to (Π[I≤k ], I≤k (I (t+1,h) ) , h) whenever Gstat
(A2) D (t,h)
holds.
Thus, Lemma G.1 implies that for all I ∈ I≤k (I (t,h) ) and s[I] ∈ S[I], the respective invocation of the sub-
routine EndoPolicyOptimizationǫt,h
0 (t)
outputs a policy πs[I] ∈ Γ(t) [I] that is (i) endogenous, and (ii) near-optimal
in the following one-step sense:
   
(t+1,h) (t) (t+1,h)
max dh s[I] ; µ(t) ◦t π ◦t+1 ψs[I t+1,h ] ≤ dh s[I] ; µ(t) ◦t πs[I] ◦t+1 ψs[I t+1,h ] + 4ǫ0 . (81)
π∈Π[I≤k ]

Success of EndoFactorSelectionǫt,h
0
. We appeal to Theorem E.1, verifying that the assumptions (A1)-(A3)
required by it are satisfied.
(A1) µ(t) is endogenous whenever G (<h) holds. Whenever G (t+1,h) holds, we are guaranteed that Ψ(t+1,h)
(t+1,h)
contains only endogenous policies, so that ψs[I t+1,h ] ∈ Ψ
(t+1,h)
is endogenous in particular.
b is ǫ0 /12k-approximate with respect to (Π[I≤k ], I≤k (I (t+1,h) ) , h) by Gstat
(A2) D (t,h)
.
(A3) Due to the success of EndoPolicyOptimizationǫt,h
0
(verified above), the condition in Eq. (81) is satisfied.
Hence, by Theorem E.1, EndoFactorSelectionǫt,h
0
returns a tuple (I (t,h) , Ψ(t,h) [I (t,h) ]) such that
1. I (t,h) ⊆ I⋆ .

52
2. For all s ∈ S,
   
(t) (t+1,h) (t) (t) (t+1,h)
max dh s [I⋆ ] ; µ ◦t π ◦t+1 ψs I (t+1,h) − dh s [I⋆ ] ; µ ◦t πs[I (t,h) ] ◦t+1 ψs I (t+1,h) ≤ 16ǫ0 ,
π∈Π[I⋆ ] [ ] [ ]

where we recall that ψs(t+1,h) (t)


∈ Ψ(t+1,h) and πs[I (t,h) ] ∈ Γ
(t)
[I (t,h) ].
[I (t+1,h) ]

Wrapping up. Scaling ǫ0 ← ǫ0 /16 and δ ← δ/H, and recalling that ψs(t,h) ∈ Ψ(t,h) is given by
[I (t,h) ]

ψs(t,h) (t)
:= πs[I (t,h) ] ◦t+1 ψ
(t+1,h)
,
[I (t,h) ] s[I (t+1,h) ]

we have that for all t < h,


′ (h′ )

P G (t,h) | ∩ht′ =t+1 G (t ,h) , ∩h−1
h′ =1 G ≥ 1 − δ/H,

proving the result.

Verifying conditions of Lemma A.5. We conclude by verifying that the four conditions required by Lemma A.5

hold, conditioned on ∩ht′ =t+1 G (t ,h) and G (<h) ; this justifies the application in the prequel.
 (t+1,h)
1. By construction, Ψ(t+1,h) = ψs[I (t,h) ] | s [I
(t+1,h)
] ∈ S [I (t+1,h) ] . Thus, |Ψ(t+1,h) | = |S [I (t,h) ]| ≤ S k ,
since |I (t+1,h)
| ≤ k.
 k

2. We have |Π[I≤k ]| ≤ O dk AS , since the number of factor sets of size at most k is

Xk    k
d ed 

≤ ≤ O dk , (82)

k k
k =0

k
and for any factor set I with |I| ≤ k we have |Π[I]| ≤ AS .

3. |I≤k (I (t+1,h) )| ≤ |I≤k | ≤ O dk by Eq. (82),
4. For any fixed set I with |I| ≤ k, we have |S [I]| ≤ S k .

G.4 Application of EndoPolicyOptimization in OSSR


The main guarantee for the EndoPolicyOptimizationǫt,h subroutine (Theorem D.1) implies that the policy πs[I] (t)

returned in Line 7 of OSSR is endogenous, as well as near-optimal in the following this sense:
   
(t+1,h) (t) (t+1,h)
max dh s[I] ; µ(t) ◦t π ◦t+1 ψs[I t+1,h ] ≤ dh s[I] ; µ(t) ◦t πs[I] ◦t+1 ψs[I t+1,h ] + O(ǫ).
π∈Π[I≤k ]

In this subsection we state and prove Lemma G.1, which shows that the preconditions (A1) and (A2) required
to apply Theorem D.1 are satisfied, so that the claim above indeed holds.
Lemma G.1. Fix h ∈ [H] and t ≤ h. Suppose that the following conditions hold:
(C1) µ(t) ∈ Πmix [I⋆ ] is endogenous and Ψ(t+1,h) contains only endogenous policies.
b of occupancy measures is ǫ/12k-approximate with respect to
(C2) The collection D
(µ ◦ Π[I≤k ] ◦ Ψ(t+1,h) , I≤k (I (t+1,h) ) , h) .
(t)

Then assumptions (A1) and (A2) of Theorem D.1 are satisfied when EndoPolicyOptimizationǫt,h is invoked
within OSSR, and for all I ∈ I≤k (I (t+1,h) ):

53
n o
(t)
1. The set Γ(t) [I] = πs[I] | s[I] ∈ S[I] contains only endogenous policies.
(t)
2. For all s [I] ∈ S[I], the policy πs[I] ∈ Γ [I] satisfies
 
max dh s [I] ; µ(t) ◦t π ◦t+1 ψs(t+1,h)
π∈Π[I≤k ] [I (t+1,h) ]
 
(t) (t) (t+1,h)
≤ dh s [I] ; µ ◦t πs[I] ◦t+1 ψs I (t+1,h) + 4ǫ.
[ ]

Proof of Lemma G.1. Toward proving the result,we begin with a basic observation. Fix I ∈ I≤k (I (t+1,h) )
and s [I] ∈ S [I]. Define an MDP S, A, T, Rs[I] , h where Rs[I],h = 1 {sh [I] = s[I]} and Rs[I],h′ = 0 for all
h′ 6= h. Observe that the occupancy measure for s[I] at layer h is equivalent to the (t, h) value function in
this MDP:
   
Vt,h µ(t) ◦t π ◦t+1 ψs(t+1,h) = dh s [I] ; µ (t)
◦ t π ◦ ψ (t+1,h)
t+1 s I (t+1,h) . (83)
[I (t+1,h) ] [ ]

We now show that assumptions (A1) and (A2) of Theorem D.1 hold when the theorem is invoked with this
value function, from which the result will follow.

Verifying assumption (A1) of Theorem D.1. The policies µ(t) and ψs(t+1,h) ∈ Ψ(t+1,h) are endogenous by
[I (t+1,h) ]
condition (C1). Hence, the assumptions of the restriction lemma (Lemma B.2) are satisfied, which gives
   
(t) (t+1,h) (t) (t+1,h)
max dh s [I] ; µ ◦t π ◦t+1 ψs I (t+1,h) = max dh s [I] ; µ ◦t π ◦t+1 ψs I (t+1,h)
π∈Π[I] [ ] π∈Π[Ien ] [ ]
   
⇐⇒ max Vt,h µ(t) ◦t π ◦t+1 ψs(t+1,h) = max Vt,h µ(t) ◦t π ◦t+1 ψs(t+1,h) .
π∈Π[I] [I (t+1,h) ] π∈Π[Ien ] [I (t+1,h) ]

Verifying assumption (A2) of Theorem D.1. By condition (C2), we have that D b is ǫ/12k-approximate with
respect to (µ ◦ Π[I≤k ] ◦ Ψ
(t) (t+1,h)
, I≤k (I (t+1,h)
) , h), and hence
   
dbh s [I] ; µ(t) ◦t π ◦t+1 ψs(t+1,h) − d h s [I] ; µ (t)
◦ t π ◦ t+1 ψ (t+1,h)
≤ ǫ/12k
[I (t+1,h) ] s[I (t+1,h) ]
   
⇐⇒ Vbt,h µ(t) ◦t π ◦t+1 ψs(t+1,h) − V t,h µ (t)
◦ t π ◦ t+1 ψ (t+1,h)
≤ ǫ/12k.
[I (t+1,h) ] s[I (t+1,h) ]

54
H Proof of Theorem 4.1 (Correctness of ExoRL)
In this section we formally prove Theorem 4.1, which shows that ExoRL (Algorithm 3) learns an ǫ-optimal
policy for a general ExoMDP. The correctness of ExoRL is essentially a direct corollary of the results derived
for OSSR and PSDP in Appendix G and Appendix F. The high probability guarantee for OSSR (Theorem 3.1)
η/2,δ
implies that iteratively applying OSSRh results in an endogenous η/2-approximate policy covers for every
layer h ∈ [H]. Conditioning on this event, ExoPSDP is guaranteed to find an ǫ-optimal policy with high
probability (Theorem F.1).
Theorem 4.1 (Sample complexity of ExoRL). ExoRL, when invoked with parameter, ǫ ∈ (0, 1) and δ ∈ (0, 1),
returns an ǫ-optimal policy with probability at least 1 − δ, and does so using at most
   
dSAH 
O AS 3k H 2 (S k + H 2 )k 3 log · ǫ−2 + η −2
δ
episodes.
Proof of Theorem 4.1. We first show that OSSR results in a near-optimal (endogenous) policy cover,
then show that the application of ExoPSDP is successful.

η/2,δ
Application of OSSR. Let G (h) denote the event in which OSSRh ({Ψ(t) }h−1 t=1 ) returns an endogenous η/2-
approximate policy cover Ψ(h) with |Ψ(h) | ≤ S k , andlet G (<h) := ∩h−1 G
h =1 

(h)
. Theorem 3.1 states that for all
AS 4k
H 2 3
k log ( dSAH
δ ) η/2,δ
h ≥ 2, if we condition on G (<h) , then given N = O η2 samples, OSSRh ensures that
G (h) holds probability
 at least 1 − δ . Furthermore, G (1) holds trivially for h = 1. By Lemma A.4, this implies
H
that P ∩h=1 G (h)
≥ 1 − Hδ. Scaling δ ← δ/2H, we conclude that given
!
AS 4k H 2 k 3 log dSAH
δ
NOSSR = O
η2

η/2,δ H
samples across all applications of OSSRh , the collection {Ψ(h) }h=1 is a set of endogenous η/2-approximate
policy covers with probability at least 1 − δ/2. We denote this event by GOSSR , so that P (GOSSR ) ≥ 1 − δ/2.

Application of PSDP. Conditioned on the event GOSSR , the conditions of Theorem F.1 hold, so that the
application of ExoPSDP is admissible. As a result, given
!
AS 3k H 4 k 3 log dSAH
δ
NExoPSDP = O
ǫ2

samples, ExoPSDP finds an endogenous ǫ-optimal policy. We denote this event by GExoPSDP , so that P (GExoPSDP | GOSSR ) ≥
1 − δ/2.

Concluding the proof. ExoRL returns an endogenous ǫ-optimal policy when GOSSR and GExoPSDP hold, and by
the union bound P (GOSSR ∩ GExoPSDP ) ≥ 1 − δ. The total number of samples is
 !
AS 4k H 2 k 3 log dSAH
δ AS 3k H 4 k 3 log dSAH
δ
N = NOSSR + NExoPSDP ≤ O + .
η2 ǫ2

H.1 Computational Complexity of ExoRL


The ExoRL procedure can be implemented with O(dk N S k AH) runtime. In Appendix F.3, we show that
ǫ,δ
ExoPSDP can be implemented in runtime O(dk N S k AH). Similarly, OSSRh can be implemented with

55
runtime O(dk N S k A). The most computationally demanding aspect of OSSR is optimizing the function
Vbt,H (µ(t) ◦t π ◦t+1 π
bt+1:H ) over the policy class Π[I≤k ]. As shown in Appendix F.3, this procedure can be
implemented with runtime O(dk N S k A), which is repeated for H times in ExoRL.

56

You might also like