1 Introduction
Reinforcement Learning (RL) aims to learn to make sequential decisions to maximize the longterm rewards in some unknown environment, which has demonstrated plenty of successes in games (Mnih et al., 2013; Silver et al., 2016), robotics (Andrychowicz et al., 2020), and automatic algorithm design (Dai et al., 2017). These successes rely on being able to deploy the algorithms that can directly interacting with the environments to improve the policy in a trialsanderror way. However, such direct interaction with real environment to collect samples can be expensive or even impossible in other realworld applications, , education (Mandel et al., 2014), health and medicine (Murphy et al., 2001; Gottesman et al., 2019), conversation AI (Ghandeharioun et al., 2019) and recommendation system (Chen et al., 2019). Instead, we are given a collection of logged experience generated by potentially multiple and possibly unknown behavior policies.
This sampling difficulty inspired recent development of offline reinforcement learning (Levine et al., 2020)
, including evaluating the different policies, known as offline policy evaluation (OPE), and improving the current policy, known as offline policy optimization (OPO), only upon the given experiences without any further interactions. The OPE, as well as OPO, is notoriously difficult as the unbiased estimators of the policy value may suffer
exponentiallyincreasing variance in terms of horizon in the worst case
(Li et al., 2015; Jiang and Li, 2016).To overcome the “curse of horizon” in OPE, the marginalized importance sampling (MIS) based estimators have been introduced
(Liu et al., 2018; Xie et al., 2019) to the community, and further improved by (Nachum et al., 2019a; Uehara et al., 2020; Yang et al., 2020) for the practical offline settings without the knowledge about the behavior polices. The basic idea of this family of estimators is estimating the marginal stateaction density ratio between the target policy and the empirical data to compensate the mismatch. Algorithmically, the marginal density ratio estimation can be implemented by either plugin estimator (Xie et al., 2019; Yin and Wang, 2020), temporaldifference update (Hallak and Mannor, 2017; Gelada and Bellemare, 2019), or solving a  optimization (Nachum et al., 2019a; Uehara et al., 2020; Zhang et al., 2020a; Yang et al., 2020). Straightforwardly, the these OPE estimators can be used as one component for offline policy optimization, resulting the algorithms in (Nachum et al., 2019b; Yin et al., 2020; Liu et al., 2020a).The motivation of this work is to deeply investigate the statistical performance of the offline MISbased RL algorithms. To identify the essential statistical factor for the superior empirical performances of MISbased methods and alleviate the unnecessary complication, our analysis focuses on the offline RL under the episodic timehomogenous tabular Markov decision process (MDP) model with states, actions and perstep reward upper bounded by . Our analysis exploits the equivalence between modelbased plugin estimator, LSTDQ (Duan and Wang, 2020), and MIS estimators, therefore generally applies to all these methods in our setting.
With an additional assumption that the total reward can be upper bounded by almost surely, our main contributions can be summarized as below:

For the offline evaluation, we obtain a finitesample complexity where is the number of episodes and is the minimum density of specific stateaction pair in the offline data that lies in , which matches the lower bound up to logarithm factors. We emphasize that the bound has no additional dependency in the higher order term, unlike the known result of Yin and Wang (2020); Yin et al. (2020).

For the offline policy optimization, we obtain an asymptotically optimal performance gap of , while matches the lower bound up to a factor and a highorder term. This result improves the best known result from Cui and Yang (2020) by the additional or factors in the main term.

Technically, we generalize the recursion methods introduced in Zhang et al. (2020b) for offline setting with plugin solver, which could be of individual interest.
To the best of our knowledge, these are the first set of nearly horizonfree bounds for both offline policy evaluation and offline policy optimization for timehomogeneous MDP.
2 Related Work
In this section, we briefly discuss the related literature in three categories, , offline evaluation, offline policy optimization, and horizonfree online reinforcement learning. Notice that, for the setting assumes an additional generative model (a.k.a a simulator), typical modelbased algorithms first query equal number of data from each stateaction pair, then do offline evaluation/policy optimization based on the queried data. Thus we view the reinforcement learning with generative model as a special instance of offline reinforcement learning. We emphasize that our results matches the lower bound and achieves horizonfree dependency, for both offline evaluation and offline policy optimization. To make the comparison fair, for method and analysis that don’t assume Assumption 3.4, we scale the error and sample complexity, by assuming perstep reward is upper bounded by and under infinitehorizon and finitehorizon setting correspondingly.
2.1 Offline Evaluation
Analysis  Setting  NonUniform Reward  Sample Complexity 

Li et al. (2020)  Infinite Horizon  Yes  
Pananjady and Wainwright (2020)  Infinite Horizon  Yes  
Yin and Wang (2020)  Finite Horizon timeinhomogeneous  No  
Jiang and Li (2016)  FiniteHorizon timeinhomogeneous  No  
This work  Finite Horizon timehomogeneous  Yes  
Lower Bound  Finite Horizon timehomogeneous  Yes 
Offline evaluation draws lots of attention recently, due to its broad application in real scenarios. For OPE in infinite horizon MDPs, Li et al. (2020) shows that plugin estimator can achieve the error of under Assumption 3.4, which matches the lower bound in (Pananjady and Wainwright, 2020) up to logarithmic factors. For OPE in finite horizon timeinhomogeneous MDPs, Yin and Wang (2020); Yin et al. (2020) provides an error bound of for the MIS estimator with the uniform reward assumption, which matches the lower bound (Jiang and Li, 2016) up to logarithmic factors and an additional higherorder term. We remark that, in Jiang and Li (2016); Yin and Wang (2020), the authors provide stronger CramerRao type upper and lower bounds that depend on the specific MDP instance, which cannot be evaluated in practice. We summarize the main result for the offline evaluation for tabular case in Table 1.
Beyond the tabular setting, Duan and Wang (2020) considers the performance of plugin estimator with linear function approximation under the assumption of linear MDP, and Kallus and Uehara (2019, 2020) provide more detailed analyses on the statistical properties of different kinds of estimators under different assumptions, which are not directly comparable to our work.
2.2 Offline Policy Optimization
Analysis  Setting  NonUniform Reward  Sample Complexity 

Agarwal et al. (2020)  Infinite Horizon  No  
Li et al. (2020)  Infinite Horizon  Yes  
Yin et al. (2020)  Finite Horizon timeinhomogeneous  No  
Cui and Yang (2020)  Finite Horizon timehomogeneous  No  
Zhang et al. (2020b)  Finite Horizon timehomogeneous Online  Yes  
This Work  Finite Horizon timehomogeneous  Yes  
Lower Bound  Finite Horizon timehomogeneous  Yes 
Offline policy optimization for infinite horizon MDP can date back to Azar et al. (2013). Li et al. (2020) recently shows that a perturbed version of modelbased planning can find optimal policy within queries of transitions in infinite horizon MDP when the total reward is upper bounded by , that matches the lower bound up to logarithmic factors. For the finite horizon timeinhomogeneous MDP setting, Yin et al. (2020) shows that modelbased planning can identify optimal policy with episodes, that matches the lower bound for timeinhomogeneous MDP up to logarithmic factors. When comes to finite horizon timehomogeneous MDP, the only known result (Cui and Yang, 2020) provides a episode complexity for modelbased planning, which is away from the lower bound. We summarize the result for the offline policy optimization for tabular case in Table 2.
There are also works considering the offline policy optimization with function approximation (e.g. Chen and Jiang, 2019; Xie and Jiang, 2020, 2020). Recently, Liu et al. (2020b) also introduces a new perspective on performing offline policy optimization within a local policy set when the offline data is not sufficient exploratory.
We also remark that a concurrent work (Yin et al., 2021) considers solving the offline policy optimization with modelfree learning with variance reduction. Under the assumption that perstep reward is upper bounded by , their proposed algorithm can match the sample complexity lower bound up to logarithmic factors without explicit higherorder term. However, as several paper discussed before (e.g. Li et al., 2020), the modelfree algorithms generally suffer from a sample size barrier, they need at least episodes of data to run the algorithm, and thus suffered from constrained range of (notice that the possible range of error with assumption that perstep reward is upper bounded by should be ). In contrast, modelbased algorithm can accommodate a much broader range of sample size, and thus a much broader range of . How to translate their results to the setting with total reward upper bounded by is still unclear. We also remark that they provide a more finegrained discussion on general forms of minimum reaching probability.
2.3 HorizonFree Online Reinforcement Learning
Recently, there are several works obtained nearly horizonfree sample complexity bounds in online reinforcement learning. Whether in the timehomogeneous setting, the sample complexity needs to scale polynomially with was an open problem raised by Jiang and Agarwal (2018). The problem was addressed by Wang et al. (2020) who proposed an net over the optimal policies and a simulationbased algorithms to obtain a sample complexity that only scales logarithmically with , though their dependency on and is not optimal and their algorithm is not computationally efficient. The sample complexity bound was later substantially improved by Zhang et al. (2020b) who obtained an bound, which nearly matches the contextual bandits (tabular MDP with ) lower bound up to an factor. Their key ideas are 1) a new bonus function and 2) a recursionbased approach to obtain a tight bound on the sample complexity. We generalize their recursionbased analysis for the offline setting in our paper.
3 Problem Setup
3.1 Notation
Throughout this paper, we use to denote the set for , to denote the set of all of the probability measure over the event set . Moreover, for simplicity, we use to denote the (that can be changed in the context), where is the failure probability. We use and to denote the upper bound and lower bound up to logarithm factors.
3.2 Markov Decision Process
Markov Decision Process (MDP) is one of the most standard models studied in the reinforcement learning community, usually denoted as , where is the state space, is the action space, is the reward, is the transition function, and is the initial state distribution. We additionally define to denote the expected reward.
We focus on the episodic tabular MDP with number of state , number of action , the horizon length^{1}^{1}1A common belief is that we can always reproduce the results between episodic timehomogeneous setting and infinite horizon setting via substitute the horizon in episodic setting with the “effective horizon” in episodic setting. However, this argument does not always hold, for example the dependency decouple technique used in Agarwal et al. (2020); Li et al. (2020) cannot be directly applied in the episodic setting. and timehomogeneous setting^{2}^{2}2Some previous work consider timeinhomogeneous setting (e.g. Jin et al., 2018), where and can be varied for different . It is noteworthy that we need an additional factor in the sample complexity to identify optimal policy for timeinhomogeneous MDP compared with timehomogeneous MDP. Transforming the improvement analysis from timehomogeneous setting to timeinhomogeneous setting is easy (we only need to replace with ), but not viceversa, as the analysis for timeinhomogeneous setting probably do not exploit the invariant transition sufficiently. that and do not depend on the level .
A (potentially nonstationary) policy is defined as , where , . We can then define the following value function and actionvalue function (i.e. the Qfunction) as:
(1)  
(2) 
which is the expected cumulative reward under the transition and policy starting from and . It is easy to see that:
(3) 
as well as the Bellman equation (define ):
(4) 
Notice that, even and keep invariant under the change of , and always depend on in the episodic setting, which introduces additional technical difficulties on the analysis, compared with the infinite horizon setting.
The expected cumulative reward of under policy is defined as:
and the ultimate goal for reinforcement learning is finding the optimal policy of , which is defined as:
We additionally define the following reaching probabilities:
that represent the probability of reaching state and stateaction pair at time step . It’s easy to show that
and we also have the following relations between and
With the definition of at hand, we can conclude the following equation:
(5) 
which provides a dual perspective of on the policy evaluation (Yang et al., 2020).
3.3 Offline Reinforcement Learning
Generally, and are not revealed to the learner, which means we can only learn about and identify the optimal policy with data from different kinds of sources. In offline reinforcement learning, the learner can only have access to a collection of data where and , that is collected in episodes with (known or unknown) behavior policy (so that ), For simplicity, define as the number of data that , while is the number of data that .
With , the learner is generally asked to do two kinds of tasks. The first one is the offline evaluation (a.k.a offpolicy evaluation), that aims to estimate for the given . The second one is the offline improvement, that aims to find the that can perform well on . We are interested in the statistical limit due to the limited number of data, and how to approach this statistical limit with simple and computational efficient algorithms.
3.4 Assumptions
Here we summarize the assumptions we use throughout the paper. [Bounded Total Reward] , we have almost surely, where , , and , . Notice that this also means almost surely. This is the key assumption used in Wang et al. (2020); Zhang et al. (2020b) to escape the curse of horizon in episodic reinforcement learning. As discussed in Jiang and Agarwal (2018); Wang et al. (2020); Zhang et al. (2020b), this assumption is also more general than the uniformly bounded reward assumption: . Thus, all of the results in this paper can be generalized to the uniformly bounded reward with a proper scaling of the bounded total reward.
[Data Coverage] , . This assumption have been used in Yin and Wang (2020); Yin et al. (2020) and is similar to concentration coefficient assumption originated from Munos (2003). Intuitively, the performance of the offline reinforcement learning should be depend on , as the stateaction pair with less visitation will introduce more uncertainty.
Notice that, . Assuming access to the generative model, we can query equal number of samples from each state action pair, where . For the offline data that sample with a fixed behavior policy , we can view
which measures the quality of exploration for , and when the number of episodes , by standard concentration, we know , .
4 Offline Evaluation
In this section, we consider the offline evaluation problem, which is the basis of the offline improvement. We first introduce the plugin estimator we consider, which is equivalent to different kinds of estimators that are widely used in practice. Then we show the error bound of the plugin estimator, and provide the proof sketch of the error bound.
4.1 Estimator
Here we consider the plugin estimator, which first build the empirical MDP with the data:
(6)  
(7) 
where is the indicator function, then correspondingly define , and , , by substituting the and in , and with and . Such work can be done with dynamic programming, see Algorithm 1 for the detail.
We correspondingly introduce the reaching probabilities in the empirical MDP as:
The plugin estimator has been studied in Duan and Wang (2020) under the assumption of linear transition, and it’s known that the plugin estimator is equivalent to the MIS estimator proposed in Yin and Wang (2020) and a certain version of DualDICE estimator with batch update proposed in Nachum et al. (2019a), due to the observation that
4.2 Theoretical Guarantee
Under Assumption 3.4 and Assumption 3.4, suppose , we have
holds with probability at least , where is the number of episodes, is the minimum visiting probability and absorbs the polylog factors.
Remark
Theorem 4.2 shows that, even with the simplest plugin estimator, we can match the minimax lower bound for offline evaluation in timehomogeneous MDP up to the logarithmic factors^{3}^{3}3To the best of our knowledge, no lower bound has been presented for the finite horizon timehomogeneous setting, so we provide a minimax lower bound in Theorem C.1 in Appendix., which means that, for offline evaluation with plugin estimator, timehomogeneous MDP is not harder than the bandits.
Remark
The assumption is a mild and necessary assumption, as with only episodes, there can exist some underexplored stateaction pair that can unavoidably lead to constant error, as suggested by the lower bound. Such assumption is also needed in previous work like Yin and Wang (2020); Yin et al. (2020), even though their bounds are policydependent in timeinhomogeneous MDP.
Remark
We remark that, although problemdependent upper bound like Jiang and Li (2016); Kallus and Uehara (2019) and computable confidence bound like Duan and Wang (2020) are possible for finite horizon timeinhomogeneous MDP, the results cannot be directly translated to the nearly horizonfree bound, which are out of the scope of this paper.
4.3 Proof Sketch
The proof of all of the technical lemmas can be found in Appendix B. Our proof is organized as follows: we first decompose the estimation error to the errors introduced by reward estimation and transition estimation with Lemma 4.3, then Lemma 4.3 provides an upper bound for the error introduced by . For error introduced by , we first show it can be upper bounded by a total variance term (in (9)). A naive bound for the total variance will introduce an additional
factors, so we recursively upper bounded the total variance via higher moments (see Lemma
4.3). Solve the recursion in Lemma 4.3, and put everything together, we eventually obtain the bound in Theorem 4.2. [Value Difference Lemma](8) 
The following lemma provides an upper bound on the estimation error introduced by , , the first term in (8). [Error from Reward Estimation] Suppose Assumption 3.4 holds, then we have that
holds with probability at least . We then use a recursive method to bound the error introduced by , , the second term in (8).
Denote
With Bernstein’s inequality, we know that with high probability
(9) 
where the second inequality is due to CauchySchwartz inequality associated with the Assumption 3.4.
We then consider in (9). A naive bound of this term will lead to an extra factor. Therefore, we upper bound it with highorder variance recursively through the following lemma, [Variance Recursion] Assume , , then we have that
(10) 
Again, by Bernstein’s inequality and CauchySchwarz inequality, with high probability, we have
(11)  
Define
applying Lemma 4.3 with (11), we have the recursion as
(12) 
Notice that , . Now we can solve the recursion with the following lemma: For the recursion formula:
with , if , then we have that
and we need to do the recursion at most times.
Remark
We want to emphasize that, the recursion needs to be done at most times. Thus, by union bound, such recursion only introduces an additional factor in the error when , that can be absorbed by . For simplicity, we still use to denote the polylog factors in the following derivation.
5 Offline Policy Optimization
In this section, we further consider the offline policy optimization problem, which is the ultimate goal for offline reinforcement learning. We introduce the modelbased planning algorithm, which is probably the simplest algorithm for offline policy optimization. Then we analyze the performance gap of the policy obtained by modelbased planning to the optimal policy.
5.1 ModelBased Planning
We consider the optimal policy on the empirical MDP , which can be defined as
(14) 
Notice that can be obtained by dynamic programming with the empirical MDP, which is also known as modelbased planning. We illustrate the algorithm in Algorithm 2. We remark that our analysis is independent to the algorithm used for solving (14). In other words, the result also applies to the optimizationbased planning with the empirical MDPs (Puterman, 2014; Mohri et al., 2012), as long as it solves (14).
5.2 Theoretical Guarantee
Under Assumption 3.4 and Assumption 3.4, suppose that , then with probability at least we have that (recall absorbs polylog factors):
Remark
Theorem 5.2 provides a bound approaching the minimax lower bound up to logarithmic factors and a higherorder term, which shows that the error of offline policy optimization does not scale polynomially on the horizon. Notice that if , we can obtain an error bound of , which can be translated to sample complexity that matches the best known result of sample complexity for finitehorizon timehomogeneous setting (Zhang et al., 2020b). We conjecture that the additional factor in the higherorder term is only an artifact (see Lemma 5.3) and can be eliminated with more delicate analysis. We leave this as an open question.
5.3 Proof Sketch
We first make the following standard decomposition:
(15) 
The first term characterizes the evaluation difference of optimal policy on original MDP and the empirical MDP, and the second term characterize the evaluation difference of the planning result from the empirical MDP on original MDP and the empirical MDP.
We can directly apply Theorem 4.2 to bound the first term in (5.3). However, as has complicated statistical dependency with , we cannot apply Theorem 4.2 on the second term in (5.3) for the evaluation error on datadependent policy. Notice that a direct application of the absorbing MDP techniques introduced in Agarwal et al. (2020); Li et al. (2020) for the second term will introduce additional or factors in the main term as shown in Cui and Yang (2020). Thus, we use recursive method to keep the main term tight while only introduce an additional factor at the higherorder term, that keeps the final error horizonfree.
Similar to the case in the offline evaluation, we make the following decomposition based on Lemma 4.3:
where
For the inner product of with and term, as and is independent of , we can identically apply the result for offline evaluation, that leads to a error. We then consider the higher order term.
To deal with the dependency of and in the highorder term , we introduce the following lemma: Assume , , then we have that with high probability,
Remark
Lemma 5.3 have been widely used in the design and analysis of online reinforcement learning algorithms, (e.g. Zhang et al., 2020b). It holds even depends on , however, at the cost of an additional factor. This is the source of the additional factor in the higherorder term, and we believe a more finegrained analysis can help avoid this additional factor.
With Lemma 5.3, we can apply the recursion based technique again to bound the error introduced by , and finally obtain the bound in Theorem 5.2. By Lemma 5.3 and CauchySchwartz inequality, we have that
(16) 
Now we turn to in (16). We still bound this term with the recursive methods. With Lemma 4.3, we have that
(17) 
We can further apply Lemma 5.3 and CauchySchwartz inequality to the first term in (17), which eventually lead to the recursion formula. Specifically, denote
we have the recursion as
Comments
There are no comments yet.