Nearly Horizon-Free Offline Reinforcement Learning

We revisit offline reinforcement learning on episodic time-homogeneous tabular Markov Decision Processes with S states, A actions and planning horizon H. Given the collected N episodes data with minimum cumulative reaching probability d_m, we obtain the first set of nearly H-free sample complexity bounds for evaluation and planning using the empirical MDPs: 1.For the offline evaluation, we obtain an Õ(√(1/Nd_m)) error rate, which matches the lower bound and does not have additional dependency on (S,A) in higher-order term, that is different from previous works <cit.>. 2.For the offline policy optimization, we obtain an Õ(√(1/Nd_m) + S/Nd_m) error rate, improving upon the best known result by <cit.>, which has additional H and S factors in the main term. Furthermore, this bound approaches the Ω(√(1/Nd_m)) lower bound up to logarithmic factors and a high-order term. To the best of our knowledge, these are the first set of nearly horizon-free bounds in offline reinforcement learning.

There are no comments yet.

Authors

• 15 publications
• 4 publications
• 77 publications
• 64 publications
• 46 publications
• Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

We consider the problem of offline reinforcement learning (RL) – a well-...
02/02/2021 ∙ by Ming Yin, et al. ∙ 45

• C-Learning: Horizon-Aware Cumulative Accessibility Estimation

Multi-goal reaching is an important problem in reinforcement learning ne...
11/24/2020 ∙ by Panteha Naderian, et al. ∙ 0

• Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

We study the offline reinforcement learning (offline RL) problem, where ...
10/17/2021 ∙ by Ming Yin, et al. ∙ 0

• Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

Recently, there has been significant progress in understanding reinforce...
10/29/2015 ∙ by Christoph Dann, et al. ∙ 0

• Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

We consider the problem of off-policy evaluation for reinforcement learn...
01/29/2020 ∙ by Ming Yin, et al. ∙ 12

• Navigating to the Best Policy in Markov Decision Processes

We investigate the classical active pure exploration problem in Markov D...
06/05/2021 ∙ by Aymen Al Marjani, et al. ∙ 24

• Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

This work studies the statistical limits of uniform convergence for offl...
05/13/2021 ∙ by Ming Yin, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) aims to learn to make sequential decisions to maximize the long-term rewards in some unknown environment, which has demonstrated plenty of successes in games (Mnih et al., 2013; Silver et al., 2016), robotics (Andrychowicz et al., 2020), and automatic algorithm design (Dai et al., 2017). These successes rely on being able to deploy the algorithms that can directly interacting with the environments to improve the policy in a trials-and-error way. However, such direct interaction with real environment to collect samples can be expensive or even impossible in other real-world applications, , education (Mandel et al., 2014), health and medicine (Murphy et al., 2001; Gottesman et al., 2019), conversation AI (Ghandeharioun et al., 2019) and recommendation system (Chen et al., 2019). Instead, we are given a collection of logged experience generated by potentially multiple and possibly unknown behavior policies.

This sampling difficulty inspired recent development of offline reinforcement learning (Levine et al., 2020)

, including evaluating the different policies, known as offline policy evaluation (OPE), and improving the current policy, known as offline policy optimization (OPO), only upon the given experiences without any further interactions. The OPE, as well as OPO, is notoriously difficult as the unbiased estimators of the policy value may suffer

exponentially

increasing variance in terms of horizon in the worst case

(Li et al., 2015; Jiang and Li, 2016).

To overcome the “curse of horizon” in OPE, the marginalized importance sampling (MIS) based estimators have been introduced

(Liu et al., 2018; Xie et al., 2019) to the community, and further improved by (Nachum et al., 2019a; Uehara et al., 2020; Yang et al., 2020) for the practical offline settings without the knowledge about the behavior polices. The basic idea of this family of estimators is estimating the marginal state-action density ratio between the target policy and the empirical data to compensate the mismatch. Algorithmically, the marginal density ratio estimation can be implemented by either plug-in estimator (Xie et al., 2019; Yin and Wang, 2020), temporal-difference update (Hallak and Mannor, 2017; Gelada and Bellemare, 2019), or solving a - optimization (Nachum et al., 2019a; Uehara et al., 2020; Zhang et al., 2020a; Yang et al., 2020). Straightforwardly, the these OPE estimators can be used as one component for offline policy optimization, resulting the algorithms in (Nachum et al., 2019b; Yin et al., 2020; Liu et al., 2020a).

The motivation of this work is to deeply investigate the statistical performance of the offline MIS-based RL algorithms. To identify the essential statistical factor for the superior empirical performances of MIS-based methods and alleviate the unnecessary complication, our analysis focuses on the offline RL under the episodic time-homogenous tabular Markov decision process (MDP) model with states, actions and per-step reward upper bounded by . Our analysis exploits the equivalence between model-based plug-in estimator, LSTDQ (Duan and Wang, 2020), and MIS estimators, therefore generally applies to all these methods in our setting.

With an additional assumption that the total reward can be upper bounded by almost surely, our main contributions can be summarized as below:

• For the offline evaluation, we obtain a finite-sample complexity where is the number of episodes and is the minimum density of specific state-action pair in the offline data that lies in , which matches the lower bound up to logarithm factors. We emphasize that the bound has no additional dependency in the higher order term, unlike the known result of Yin and Wang (2020); Yin et al. (2020).

• For the offline policy optimization, we obtain an asymptotically optimal performance gap of , while matches the lower bound up to a -factor and a high-order term. This result improves the best known result from Cui and Yang (2020) by the additional or factors in the main term.

• Technically, we generalize the recursion methods introduced in Zhang et al. (2020b) for offline setting with plug-in solver, which could be of individual interest.

To the best of our knowledge, these are the first set of nearly horizon-free bounds for both offline policy evaluation and offline policy optimization for time-homogeneous MDP.

2 Related Work

In this section, we briefly discuss the related literature in three categories, , offline evaluation, offline policy optimization, and horizon-free online reinforcement learning. Notice that, for the setting assumes an additional generative model (a.k.a a simulator), typical model-based algorithms first query equal number of data from each state-action pair, then do offline evaluation/policy optimization based on the queried data. Thus we view the reinforcement learning with generative model as a special instance of offline reinforcement learning. We emphasize that our results matches the lower bound and achieves horizon-free dependency, for both offline evaluation and offline policy optimization. To make the comparison fair, for method and analysis that don’t assume Assumption 3.4, we scale the error and sample complexity, by assuming per-step reward is upper bounded by and under infinite-horizon and finite-horizon setting correspondingly.

2.1 Offline Evaluation

Offline evaluation draws lots of attention recently, due to its broad application in real scenarios. For OPE in infinite horizon MDPs, Li et al. (2020) shows that plug-in estimator can achieve the error of under Assumption 3.4, which matches the lower bound in (Pananjady and Wainwright, 2020) up to logarithmic factors. For OPE in finite horizon time-inhomogeneous MDPs, Yin and Wang (2020); Yin et al. (2020) provides an error bound of for the MIS estimator with the uniform reward assumption, which matches the lower bound (Jiang and Li, 2016) up to logarithmic factors and an additional higher-order term. We remark that, in Jiang and Li (2016); Yin and Wang (2020), the authors provide stronger Cramer-Rao type upper and lower bounds that depend on the specific MDP instance, which cannot be evaluated in practice. We summarize the main result for the offline evaluation for tabular case in Table 1.

Beyond the tabular setting, Duan and Wang (2020) considers the performance of plug-in estimator with linear function approximation under the assumption of linear MDP, and Kallus and Uehara (2019, 2020) provide more detailed analyses on the statistical properties of different kinds of estimators under different assumptions, which are not directly comparable to our work.

2.2 Offline Policy Optimization

Offline policy optimization for infinite horizon MDP can date back to Azar et al. (2013). Li et al. (2020) recently shows that a perturbed version of model-based planning can find -optimal policy within queries of transitions in infinite horizon MDP when the total reward is upper bounded by , that matches the lower bound up to logarithmic factors. For the finite horizon time-inhomogeneous MDP setting, Yin et al. (2020) shows that model-based planning can identify -optimal policy with episodes, that matches the lower bound for time-inhomogeneous MDP up to logarithmic factors. When comes to finite horizon time-homogeneous MDP, the only known result (Cui and Yang, 2020) provides a episode complexity for model-based planning, which is away from the lower bound. We summarize the result for the offline policy optimization for tabular case in Table 2.

There are also works considering the offline policy optimization with function approximation (e.g. Chen and Jiang, 2019; Xie and Jiang, 2020, 2020). Recently, Liu et al. (2020b) also introduces a new perspective on performing offline policy optimization within a local policy set when the offline data is not sufficient exploratory.

We also remark that a concurrent work (Yin et al., 2021) considers solving the offline policy optimization with model-free -learning with variance reduction. Under the assumption that per-step reward is upper bounded by , their proposed algorithm can match the sample complexity lower bound up to logarithmic factors without explicit higher-order term. However, as several paper discussed before (e.g. Li et al., 2020), the model-free algorithms generally suffer from a sample size barrier,  they need at least episodes of data to run the algorithm, and thus suffered from constrained range of (notice that the possible range of error with assumption that per-step reward is upper bounded by should be ). In contrast, model-based algorithm can accommodate a much broader range of sample size, and thus a much broader range of . How to translate their results to the setting with total reward upper bounded by is still unclear. We also remark that they provide a more fine-grained discussion on general forms of minimum reaching probability.

2.3 Horizon-Free Online Reinforcement Learning

Recently, there are several works obtained nearly horizon-free sample complexity bounds in online reinforcement learning. Whether in the time-homogeneous setting, the sample complexity needs to scale polynomially with was an open problem raised by Jiang and Agarwal (2018). The problem was addressed by Wang et al. (2020) who proposed an -net over the optimal policies and a simulation-based algorithms to obtain a sample complexity that only scales logarithmically with , though their dependency on and is not optimal and their algorithm is not computationally efficient. The sample complexity bound was later substantially improved by Zhang et al. (2020b) who obtained an bound, which nearly matches the contextual bandits (tabular MDP with ) lower bound up to an factor. Their key ideas are 1) a new bonus function and 2) a recursion-based approach to obtain a tight bound on the sample complexity. We generalize their recursion-based analysis for the offline setting in our paper.

3 Problem Setup

3.1 Notation

Throughout this paper, we use to denote the set for , to denote the set of all of the probability measure over the event set . Moreover, for simplicity, we use to denote the (that can be changed in the context), where is the failure probability. We use and to denote the upper bound and lower bound up to logarithm factors.

3.2 Markov Decision Process

Markov Decision Process (MDP) is one of the most standard models studied in the reinforcement learning community, usually denoted as , where is the state space, is the action space, is the reward, is the transition function, and is the initial state distribution. We additionally define to denote the expected reward.

We focus on the episodic tabular MDP with number of state , number of action , the horizon length111A common belief is that we can always reproduce the results between episodic time-homogeneous setting and infinite horizon setting via substitute the horizon in episodic setting with the “effective horizon” in episodic setting. However, this argument does not always hold, for example the dependency decouple technique used in Agarwal et al. (2020); Li et al. (2020) cannot be directly applied in the episodic setting. and time-homogeneous setting222Some previous work consider time-inhomogeneous setting (e.g. Jin et al., 2018), where and can be varied for different . It is noteworthy that we need an additional factor in the sample complexity to identify -optimal policy for time-inhomogeneous MDP compared with time-homogeneous MDP. Transforming the improvement analysis from time-homogeneous setting to time-inhomogeneous setting is easy (we only need to replace with ), but not vice-versa, as the analysis for time-inhomogeneous setting probably do not exploit the invariant transition sufficiently. that and do not depend on the level .

A (potentially non-stationary) policy is defined as , where , . We can then define the following value function and action-value function (i.e. the Q-function) as:

 Vπh(s):= EP,π[H∑t=hR(st,at)∣∣∣sh=s], (1) Qπh(s,a):= EP,π[H∑t=hR(st,at)∣∣∣(sh,ah)=(s,a)], (2)

which is the expected cumulative reward under the transition and policy starting from and . It is easy to see that:

 Vπh(s)=Ea∼π(⋅|s)[Qπh(s,a)], (3)

as well as the Bellman equation (define ):

 Qπh(s,a)=r(s,a)+Es′∼P(⋅|s,a)[Vπh+1(s)]. (4)

Notice that, even and keep invariant under the change of , and always depend on in the episodic setting, which introduces additional technical difficulties on the analysis, compared with the infinite horizon setting.

The expected cumulative reward of under policy is defined as:

 vπ:=Eμ[Vπ1(s)],

and the ultimate goal for reinforcement learning is finding the optimal policy of , which is defined as:

 π∗=argmaxπ(vπ).

We additionally define the following reaching probabilities:

 ξπh(s)= Pμ,P,π(sh=s), ξπh(s,a)= Pμ,P,π(sh=s,ah=a),

that represent the probability of reaching state and state-action pair at time step . It’s easy to show that

 ∑sξπh(s)=1,∀h∈[H], ∑s,aξπh(s,a)=1,∀h∈[H],

and we also have the following relations between and

 ξπh(s,a)= ξπh(s)π(a|s) ξπh+1(s′)= ∑s,aξπh(s,a)P(s′|s,a).

With the definition of at hand, we can conclude the following equation:

 vπ=∑s,a⎛⎝⎛⎝∑h∈[H]ξπh(s,a)⎞⎠r(s,a)⎞⎠, (5)

which provides a dual perspective of on the policy evaluation (Yang et al., 2020).

3.3 Offline Reinforcement Learning

Generally, and are not revealed to the learner, which means we can only learn about and identify the optimal policy with data from different kinds of sources. In offline reinforcement learning, the learner can only have access to a collection of data where and , that is collected in episodes with (known or unknown) behavior policy (so that ), For simplicity, define as the number of data that , while is the number of data that .

With , the learner is generally asked to do two kinds of tasks. The first one is the offline evaluation (a.k.a off-policy evaluation), that aims to estimate for the given . The second one is the offline improvement, that aims to find the that can perform well on . We are interested in the statistical limit due to the limited number of data, and how to approach this statistical limit with simple and computational efficient algorithms.

3.4 Assumptions

Here we summarize the assumptions we use throughout the paper. [Bounded Total Reward] , we have almost surely, where , , and , . Notice that this also means almost surely. This is the key assumption used in Wang et al. (2020); Zhang et al. (2020b) to escape the curse of horizon in episodic reinforcement learning. As discussed in Jiang and Agarwal (2018); Wang et al. (2020); Zhang et al. (2020b), this assumption is also more general than the uniformly bounded reward assumption: . Thus, all of the results in this paper can be generalized to the uniformly bounded reward with a proper scaling of the bounded total reward.

[Data Coverage] , . This assumption have been used in Yin and Wang (2020); Yin et al. (2020) and is similar to concentration coefficient assumption originated from Munos (2003). Intuitively, the performance of the offline reinforcement learning should be depend on , as the state-action pair with less visitation will introduce more uncertainty.

Notice that, . Assuming access to the generative model, we can query equal number of samples from each state action pair, where . For the offline data that sample with a fixed behavior policy , we can view

 dm≈1Hmins,a∑h∈[H]ξπBEHh(s,a),

which measures the quality of exploration for , and when the number of episodes , by standard concentration, we know , .

4 Offline Evaluation

In this section, we consider the offline evaluation problem, which is the basis of the offline improvement. We first introduce the plug-in estimator we consider, which is equivalent to different kinds of estimators that are widely used in practice. Then we show the error bound of the plug-in estimator, and provide the proof sketch of the error bound.

4.1 Estimator

Here we consider the plug-in estimator, which first build the empirical MDP with the data:

 ^P(s′|s,a)= n(s,a,s′)n(s,a), (6) ^r(s,a)= ∑i∈[n]r(si,ai)1(si,ai)=(s,a)n(s,a), (7)

where is the indicator function, then correspondingly define , and , , by substituting the and in , and with and . Such work can be done with dynamic programming, see Algorithm 1 for the detail.

We correspondingly introduce the reaching probabilities in the empirical MDP as:

 ^ξπh(s)= Pμ,^P,π(sh=s), ^ξπh(s,a)= Pμ,^P,π(sh=s,ah=a).

The plug-in estimator has been studied in Duan and Wang (2020) under the assumption of linear transition, and it’s known that the plug-in estimator is equivalent to the MIS estimator proposed in Yin and Wang (2020) and a certain version of DualDICE estimator with batch update proposed in Nachum et al. (2019a), due to the observation that

 ^vπ=∑s,a((∑h∈[H]^ξπh(s,a))^r(s,a)).

4.2 Theoretical Guarantee

Under Assumption 3.4 and Assumption 3.4, suppose , we have

 |vπ−^vπ|≤√ιNdm

holds with probability at least , where is the number of episodes, is the minimum visiting probability and absorbs the polylog factors.

Remark

Theorem 4.2 shows that, even with the simplest plug-in estimator, we can match the minimax lower bound for offline evaluation in time-homogeneous MDP up to the logarithmic factors333To the best of our knowledge, no lower bound has been presented for the finite horizon time-homogeneous setting, so we provide a minimax lower bound in Theorem C.1 in Appendix., which means that, for offline evaluation with plug-in estimator, time-homogeneous MDP is not harder than the bandits.

Remark

The assumption is a mild and necessary assumption, as with only episodes, there can exist some under-explored state-action pair that can unavoidably lead to constant error, as suggested by the lower bound. Such assumption is also needed in previous work like Yin and Wang (2020); Yin et al. (2020), even though their bounds are policy-dependent in time-inhomogeneous MDP.

Remark

We remark that, although problem-dependent upper bound like Jiang and Li (2016); Kallus and Uehara (2019) and computable confidence bound like Duan and Wang (2020) are possible for finite horizon time-inhomogeneous MDP, the results cannot be directly translated to the nearly horizon-free bound, which are out of the scope of this paper.

4.3 Proof Sketch

The proof of all of the technical lemmas can be found in Appendix B. Our proof is organized as follows: we first decompose the estimation error to the errors introduced by reward estimation and transition estimation with Lemma 4.3, then Lemma 4.3 provides an upper bound for the error introduced by . For error introduced by , we first show it can be upper bounded by a total variance term (in (9)). A naive bound for the total variance will introduce an additional

factors, so we recursively upper bounded the total variance via higher moments (see Lemma

4.3). Solve the recursion in Lemma 4.3, and put everything together, we eventually obtain the bound in Theorem 4.2. [Value Difference Lemma]

 vπ−^vπ=∑h∈[H]∑s,a^ξπh(s,a)[r(s,a)−^r(s,a)+∑s′[(P(s′|s,a)−^P(s′|s,a))Vπh+1(s′)]]. (8)

The following lemma provides an upper bound on the estimation error introduced by , , the first term in (8). [Error from Reward Estimation] Suppose Assumption 3.4 holds, then we have that

 ∣∣∑h∈[H]∑s,a^ξπh(s,a)[r(s,a)−^r(s,a)]∣∣≤√ιNdm+ιNdm,

holds with probability at least . We then use a recursive method to bound the error introduced by , , the second term in (8).

Denote

 Δ1:=∣∣∣∑h∈[H]∑s,a^ξπh(s,a)⋅∑s′[(P(s′|s,a)−^P(s′|s,a))Vπh+1(s′)]∣∣∣.

With Bernstein’s inequality, we know that with high probability

 Δ1≤∑h∈[H]∑s,a^ξπh(s,a)⎡⎢⎣ ⎷VarP(s,a)(Vπh+1(s′))ιn(s,a)+ιn(s,a)⎤⎥⎦ (9)

where the second inequality is due to Cauchy-Schwartz inequality associated with the Assumption 3.4.

We then consider in (9). A naive bound of this term will lead to an extra factor. Therefore, we upper bound it with high-order variance recursively through the following lemma, [Variance Recursion] Assume , , then we have that

 ∑h∈[H]∑s,a^ξπh(s,a)VarP(s,a)(Vh+1(s′)2i) ≤ +2i+1[∑sμ(s)V1(s)+∣∣∣∑h∈[H]∑s,a^ξπh(s,a)∑s′[P(s′|s,a)−^P(s′|s,a)]Vh+1(s′)∣∣∣]. (10)

Again, by Bernstein’s inequality and Cauchy-Schwarz inequality, with high probability, we have

 (11) ≤√ιNdm√∑h∈[H]∑s,a^ξπh(s,a)VarP(s,a)\rbrVπh+1(s′)2i+1+ιNdm.

Define

 V1(i):=∑h∈[H]∑s,a^ξπh(s,a)VarP(s,a)(Vπh+1(s′)2i+1),

applying Lemma 4.3 with (11), we have the recursion as

 V1(i)≤√ιNdmV1(i+1)+ιNdm+2i+1(Δ1+vπ). (12)

Notice that , . Now we can solve the recursion with the following lemma: For the recursion formula:

 V(i)≤√λ1V(i+1)+λ1+2i+1λ2,

with , if , then we have that

 V(0)≤6(λ1+λ2),

and we need to do the recursion at most times.

Remark

We want to emphasize that, the recursion needs to be done at most times. Thus, by union bound, such recursion only introduces an additional factor in the error when , that can be absorbed by . For simplicity, we still use to denote the polylog factors in the following derivation.

Apply Lemma 4.3 with , , we have that

 V1(0)=O(ιNdm+Δ1+vπ),

and also notice in (9) that

 Δ1≤√ιNdmV1(0)+ιNdm.

Combine this two inequality, we have that

 Δ1≤O(√ιNdm(ιNdm+Δ1+vπ))+ιNdm.

Suppose , then with Assumption 3.4, we have that444See Appendix A for the detail.

 Δ1≤O(√ιNdm+ιNdm). (13)

Combined (13) with Lemma 4.3 and , we conclude the proof of Theorem 4.2.

5 Offline Policy Optimization

In this section, we further consider the offline policy optimization problem, which is the ultimate goal for offline reinforcement learning. We introduce the model-based planning algorithm, which is probably the simplest algorithm for offline policy optimization. Then we analyze the performance gap of the policy obtained by model-based planning to the optimal policy.

5.1 Model-Based Planning

We consider the optimal policy on the empirical MDP , which can be defined as

 ^π∗=argmaxπ^vπ. (14)

Notice that can be obtained by dynamic programming with the empirical MDP, which is also known as model-based planning. We illustrate the algorithm in Algorithm 2. We remark that our analysis is independent to the algorithm used for solving (14). In other words, the result also applies to the optimization-based planning with the empirical MDPs (Puterman, 2014; Mohri et al., 2012), as long as it solves (14).

5.2 Theoretical Guarantee

Under Assumption 3.4 and Assumption 3.4, suppose that , then with probability at least we have that (recall absorbs polylog factors):

 ∣∣vπ∗−v^π∗∣∣≤√ιNdm+SιNdm,
Remark

Theorem 5.2 provides a bound approaching the minimax lower bound up to logarithmic factors and a higher-order term, which shows that the error of offline policy optimization does not scale polynomially on the horizon. Notice that if , we can obtain an error bound of , which can be translated to sample complexity that matches the best known result of sample complexity for finite-horizon time-homogeneous setting (Zhang et al., 2020b). We conjecture that the additional factor in the higher-order term is only an artifact (see Lemma 5.3) and can be eliminated with more delicate analysis. We leave this as an open question.

5.3 Proof Sketch

We first make the following standard decomposition:

 vπ∗−v^π∗ = vπ∗−^vπ∗+^vπ∗−^v^π∗≤0+^v^π∗−v^π∗ ≤ vπ∗−^vπ∗Error on Fixed % Policy+^v^π∗−v^π∗Error % on Data-Dependent Policy. (15)

The first term characterizes the evaluation difference of optimal policy on original MDP and the empirical MDP, and the second term characterize the evaluation difference of the planning result from the empirical MDP on original MDP and the empirical MDP.

We can directly apply Theorem 4.2 to bound the first term in (5.3). However, as has complicated statistical dependency with , we cannot apply Theorem 4.2 on the second term in (5.3) for the evaluation error on data-dependent policy. Notice that a direct application of the absorbing MDP techniques introduced in Agarwal et al. (2020); Li et al. (2020) for the second term will introduce additional or factors in the main term as shown in Cui and Yang (2020). Thus, we use recursive method to keep the main term tight while only introduce an additional factor at the higher-order term, that keeps the final error horizon-free.

Similar to the case in the offline evaluation, we make the following decomposition based on Lemma 4.3:

 ^v^π∗−v^π∗= = ∑h∈[H]∑s,a^ξ^π∗h(s,a)\rbrΔr\rbrs,a+ΔP(s,a)+ΔPV\rbrs,a,

where

 Δr\rbrs,a \defeq \rhat(s,a)−r(s,a), ΔP\rbrs,a \defeq ∑s′(^P(s′|s,a)−P(s′|s,a))Vπ∗h+1(s′) ΔPV\rbrs,a \defeq ∑s′(^P(s′|s,a)−P(s′|s,a))(V^π∗h+1(s′)−Vπ∗h+1(s′)).

For the inner product of with and term, as and is independent of , we can identically apply the result for offline evaluation, that leads to a error. We then consider the higher order term.

To deal with the dependency of and in the high-order term , we introduce the following lemma: Assume , , then we have that with high probability,

 ∣∣ ∣∣∑s′(P(s′|s,a)−^P(s′|s,a))Vh(s′)∣∣ ∣∣ ≤ √S⋅Var(Vh(s′))ιn(s,a)+Sιn(s,a).
Remark

Lemma 5.3 have been widely used in the design and analysis of online reinforcement learning algorithms, (e.g. Zhang et al., 2020b). It holds even depends on , however, at the cost of an additional factor. This is the source of the additional factor in the higher-order term, and we believe a more fine-grained analysis can help avoid this additional factor.

With Lemma 5.3, we can apply the recursion based technique again to bound the error introduced by , and finally obtain the bound in Theorem 5.2. By Lemma 5.3 and Cauchy-Schwartz inequality, we have that

 Δ2:=∣∣ ∣∣∑h∈[H]∑s,a^ξπh(s,a)ΔPV\rbrs,a∣∣ ∣∣ ≤∑h∈[H]∑s,a^ξπh(s,a)⋅⎡⎢⎣ ⎷S⋅VarP(s,a)(Vπ∗h+1(s′)−V^π∗h+1(s′))ιn(s,a)+Sιn(s,a)⎤⎥⎦ ≤√SιNdm√∑h∈[H]∑s,a^ξπh(s,a)VarP(s,a)(Vπ∗h+1(s′)−V^π∗h+1(s′))+SιNdm. (16)

Now we turn to in (16). We still bound this term with the recursive methods. With Lemma 4.3, we have that

 ∑h∈[H]∑s,a^ξπh(s,a)VarP(s,a)((Vπ∗h+1(s′)−V^π∗h+1(s′))2i) ≤ (17)

We can further apply Lemma 5.3 and Cauchy-Schwartz inequality to the first term in (17), which eventually lead to the recursion formula. Specifically, denote

 V2(i):=∑h∈[H]∑s,a^ξπh(s,a)VarP(s,a)((Vπ∗h+1(s′)−V^π∗h+1(s′))2i),

we have the recursion as

 V2(i)≤√SιNdmV2(i+1)+SιNdm+2i+1(Δ2+vπ∗−v^π∗).

This recursion can be solved similarly as (12) by applying Lemma 4.3 with and , which leads to

 V2(0)≤O(SιNdm+(vπ∗−v^π∗+Δ2)). (18)

Meanwhile, from (16) we have that

 Δ2≤√SιNdmV2(0)+SιNdm. (19)

Combine (18) and (19), we have

 Δ2≤ O(√SιNdm(SιNdm+vπ∗−v^π∗+Δ