Reinforcement Learning (RL) aims to learn to make sequential decisions to maximize the long-term rewards in some unknown environment, which has demonstrated plenty of successes in games (Mnih et al., 2013; Silver et al., 2016), robotics (Andrychowicz et al., 2020), and automatic algorithm design (Dai et al., 2017). These successes rely on being able to deploy the algorithms that can directly interacting with the environments to improve the policy in a trials-and-error way. However, such direct interaction with real environment to collect samples can be expensive or even impossible in other real-world applications, , education (Mandel et al., 2014), health and medicine (Murphy et al., 2001; Gottesman et al., 2019), conversation AI (Ghandeharioun et al., 2019) and recommendation system (Chen et al., 2019). Instead, we are given a collection of logged experience generated by potentially multiple and possibly unknown behavior policies.
This sampling difficulty inspired recent development of offline reinforcement learning (Levine et al., 2020)
, including evaluating the different policies, known as offline policy evaluation (OPE), and improving the current policy, known as offline policy optimization (OPO), only upon the given experiences without any further interactions. The OPE, as well as OPO, is notoriously difficult as the unbiased estimators of the policy value may sufferexponentially
increasing variance in terms of horizon in the worst case(Li et al., 2015; Jiang and Li, 2016).
To overcome the “curse of horizon” in OPE, the marginalized importance sampling (MIS) based estimators have been introduced(Liu et al., 2018; Xie et al., 2019) to the community, and further improved by (Nachum et al., 2019a; Uehara et al., 2020; Yang et al., 2020) for the practical offline settings without the knowledge about the behavior polices. The basic idea of this family of estimators is estimating the marginal state-action density ratio between the target policy and the empirical data to compensate the mismatch. Algorithmically, the marginal density ratio estimation can be implemented by either plug-in estimator (Xie et al., 2019; Yin and Wang, 2020), temporal-difference update (Hallak and Mannor, 2017; Gelada and Bellemare, 2019), or solving a - optimization (Nachum et al., 2019a; Uehara et al., 2020; Zhang et al., 2020a; Yang et al., 2020). Straightforwardly, the these OPE estimators can be used as one component for offline policy optimization, resulting the algorithms in (Nachum et al., 2019b; Yin et al., 2020; Liu et al., 2020a).
The motivation of this work is to deeply investigate the statistical performance of the offline MIS-based RL algorithms. To identify the essential statistical factor for the superior empirical performances of MIS-based methods and alleviate the unnecessary complication, our analysis focuses on the offline RL under the episodic time-homogenous tabular Markov decision process (MDP) model with states, actions and per-step reward upper bounded by . Our analysis exploits the equivalence between model-based plug-in estimator, LSTDQ (Duan and Wang, 2020), and MIS estimators, therefore generally applies to all these methods in our setting.
With an additional assumption that the total reward can be upper bounded by almost surely, our main contributions can be summarized as below:
For the offline evaluation, we obtain a finite-sample complexity where is the number of episodes and is the minimum density of specific state-action pair in the offline data that lies in , which matches the lower bound up to logarithm factors. We emphasize that the bound has no additional dependency in the higher order term, unlike the known result of Yin and Wang (2020); Yin et al. (2020).
For the offline policy optimization, we obtain an asymptotically optimal performance gap of , while matches the lower bound up to a -factor and a high-order term. This result improves the best known result from Cui and Yang (2020) by the additional or factors in the main term.
Technically, we generalize the recursion methods introduced in Zhang et al. (2020b) for offline setting with plug-in solver, which could be of individual interest.
To the best of our knowledge, these are the first set of nearly horizon-free bounds for both offline policy evaluation and offline policy optimization for time-homogeneous MDP.
2 Related Work
In this section, we briefly discuss the related literature in three categories, , offline evaluation, offline policy optimization, and horizon-free online reinforcement learning. Notice that, for the setting assumes an additional generative model (a.k.a a simulator), typical model-based algorithms first query equal number of data from each state-action pair, then do offline evaluation/policy optimization based on the queried data. Thus we view the reinforcement learning with generative model as a special instance of offline reinforcement learning. We emphasize that our results matches the lower bound and achieves horizon-free dependency, for both offline evaluation and offline policy optimization. To make the comparison fair, for method and analysis that don’t assume Assumption 3.4, we scale the error and sample complexity, by assuming per-step reward is upper bounded by and under infinite-horizon and finite-horizon setting correspondingly.
2.1 Offline Evaluation
|Analysis||Setting||Non-Uniform Reward||Sample Complexity|
|Li et al. (2020)||Infinite Horizon||Yes|
|Pananjady and Wainwright (2020)||Infinite Horizon||Yes|
|Yin and Wang (2020)||Finite Horizon time-inhomogeneous||No|
|Jiang and Li (2016)||Finite-Horizon time-inhomogeneous||No|
|This work||Finite Horizon time-homogeneous||Yes|
|Lower Bound||Finite Horizon time-homogeneous||Yes|
Offline evaluation draws lots of attention recently, due to its broad application in real scenarios. For OPE in infinite horizon MDPs, Li et al. (2020) shows that plug-in estimator can achieve the error of under Assumption 3.4, which matches the lower bound in (Pananjady and Wainwright, 2020) up to logarithmic factors. For OPE in finite horizon time-inhomogeneous MDPs, Yin and Wang (2020); Yin et al. (2020) provides an error bound of for the MIS estimator with the uniform reward assumption, which matches the lower bound (Jiang and Li, 2016) up to logarithmic factors and an additional higher-order term. We remark that, in Jiang and Li (2016); Yin and Wang (2020), the authors provide stronger Cramer-Rao type upper and lower bounds that depend on the specific MDP instance, which cannot be evaluated in practice. We summarize the main result for the offline evaluation for tabular case in Table 1.
Beyond the tabular setting, Duan and Wang (2020) considers the performance of plug-in estimator with linear function approximation under the assumption of linear MDP, and Kallus and Uehara (2019, 2020) provide more detailed analyses on the statistical properties of different kinds of estimators under different assumptions, which are not directly comparable to our work.
2.2 Offline Policy Optimization
|Analysis||Setting||Non-Uniform Reward||Sample Complexity|
|Agarwal et al. (2020)||Infinite Horizon||No|
|Li et al. (2020)||Infinite Horizon||Yes|
|Yin et al. (2020)||Finite Horizon time-inhomogeneous||No|
|Cui and Yang (2020)||Finite Horizon time-homogeneous||No|
|Zhang et al. (2020b)||Finite Horizon time-homogeneous Online||Yes|
|This Work||Finite Horizon time-homogeneous||Yes|
|Lower Bound||Finite Horizon time-homogeneous||Yes|
Offline policy optimization for infinite horizon MDP can date back to Azar et al. (2013). Li et al. (2020) recently shows that a perturbed version of model-based planning can find -optimal policy within queries of transitions in infinite horizon MDP when the total reward is upper bounded by , that matches the lower bound up to logarithmic factors. For the finite horizon time-inhomogeneous MDP setting, Yin et al. (2020) shows that model-based planning can identify -optimal policy with episodes, that matches the lower bound for time-inhomogeneous MDP up to logarithmic factors. When comes to finite horizon time-homogeneous MDP, the only known result (Cui and Yang, 2020) provides a episode complexity for model-based planning, which is away from the lower bound. We summarize the result for the offline policy optimization for tabular case in Table 2.
There are also works considering the offline policy optimization with function approximation (e.g. Chen and Jiang, 2019; Xie and Jiang, 2020, 2020). Recently, Liu et al. (2020b) also introduces a new perspective on performing offline policy optimization within a local policy set when the offline data is not sufficient exploratory.
We also remark that a concurrent work (Yin et al., 2021) considers solving the offline policy optimization with model-free -learning with variance reduction. Under the assumption that per-step reward is upper bounded by , their proposed algorithm can match the sample complexity lower bound up to logarithmic factors without explicit higher-order term. However, as several paper discussed before (e.g. Li et al., 2020), the model-free algorithms generally suffer from a sample size barrier, they need at least episodes of data to run the algorithm, and thus suffered from constrained range of (notice that the possible range of error with assumption that per-step reward is upper bounded by should be ). In contrast, model-based algorithm can accommodate a much broader range of sample size, and thus a much broader range of . How to translate their results to the setting with total reward upper bounded by is still unclear. We also remark that they provide a more fine-grained discussion on general forms of minimum reaching probability.
2.3 Horizon-Free Online Reinforcement Learning
Recently, there are several works obtained nearly horizon-free sample complexity bounds in online reinforcement learning. Whether in the time-homogeneous setting, the sample complexity needs to scale polynomially with was an open problem raised by Jiang and Agarwal (2018). The problem was addressed by Wang et al. (2020) who proposed an -net over the optimal policies and a simulation-based algorithms to obtain a sample complexity that only scales logarithmically with , though their dependency on and is not optimal and their algorithm is not computationally efficient. The sample complexity bound was later substantially improved by Zhang et al. (2020b) who obtained an bound, which nearly matches the contextual bandits (tabular MDP with ) lower bound up to an factor. Their key ideas are 1) a new bonus function and 2) a recursion-based approach to obtain a tight bound on the sample complexity. We generalize their recursion-based analysis for the offline setting in our paper.
3 Problem Setup
Throughout this paper, we use to denote the set for , to denote the set of all of the probability measure over the event set . Moreover, for simplicity, we use to denote the (that can be changed in the context), where is the failure probability. We use and to denote the upper bound and lower bound up to logarithm factors.
3.2 Markov Decision Process
Markov Decision Process (MDP) is one of the most standard models studied in the reinforcement learning community, usually denoted as , where is the state space, is the action space, is the reward, is the transition function, and is the initial state distribution. We additionally define to denote the expected reward.
We focus on the episodic tabular MDP with number of state , number of action , the horizon length111A common belief is that we can always reproduce the results between episodic time-homogeneous setting and infinite horizon setting via substitute the horizon in episodic setting with the “effective horizon” in episodic setting. However, this argument does not always hold, for example the dependency decouple technique used in Agarwal et al. (2020); Li et al. (2020) cannot be directly applied in the episodic setting. and time-homogeneous setting222Some previous work consider time-inhomogeneous setting (e.g. Jin et al., 2018), where and can be varied for different . It is noteworthy that we need an additional factor in the sample complexity to identify -optimal policy for time-inhomogeneous MDP compared with time-homogeneous MDP. Transforming the improvement analysis from time-homogeneous setting to time-inhomogeneous setting is easy (we only need to replace with ), but not vice-versa, as the analysis for time-inhomogeneous setting probably do not exploit the invariant transition sufficiently. that and do not depend on the level .
A (potentially non-stationary) policy is defined as , where , . We can then define the following value function and action-value function (i.e. the Q-function) as:
which is the expected cumulative reward under the transition and policy starting from and . It is easy to see that:
as well as the Bellman equation (define ):
Notice that, even and keep invariant under the change of , and always depend on in the episodic setting, which introduces additional technical difficulties on the analysis, compared with the infinite horizon setting.
The expected cumulative reward of under policy is defined as:
and the ultimate goal for reinforcement learning is finding the optimal policy of , which is defined as:
We additionally define the following reaching probabilities:
that represent the probability of reaching state and state-action pair at time step . It’s easy to show that
and we also have the following relations between and
With the definition of at hand, we can conclude the following equation:
which provides a dual perspective of on the policy evaluation (Yang et al., 2020).
3.3 Offline Reinforcement Learning
Generally, and are not revealed to the learner, which means we can only learn about and identify the optimal policy with data from different kinds of sources. In offline reinforcement learning, the learner can only have access to a collection of data where and , that is collected in episodes with (known or unknown) behavior policy (so that ), For simplicity, define as the number of data that , while is the number of data that .
With , the learner is generally asked to do two kinds of tasks. The first one is the offline evaluation (a.k.a off-policy evaluation), that aims to estimate for the given . The second one is the offline improvement, that aims to find the that can perform well on . We are interested in the statistical limit due to the limited number of data, and how to approach this statistical limit with simple and computational efficient algorithms.
Here we summarize the assumptions we use throughout the paper. [Bounded Total Reward] , we have almost surely, where , , and , . Notice that this also means almost surely. This is the key assumption used in Wang et al. (2020); Zhang et al. (2020b) to escape the curse of horizon in episodic reinforcement learning. As discussed in Jiang and Agarwal (2018); Wang et al. (2020); Zhang et al. (2020b), this assumption is also more general than the uniformly bounded reward assumption: . Thus, all of the results in this paper can be generalized to the uniformly bounded reward with a proper scaling of the bounded total reward.
[Data Coverage] , . This assumption have been used in Yin and Wang (2020); Yin et al. (2020) and is similar to concentration coefficient assumption originated from Munos (2003). Intuitively, the performance of the offline reinforcement learning should be depend on , as the state-action pair with less visitation will introduce more uncertainty.
Notice that, . Assuming access to the generative model, we can query equal number of samples from each state action pair, where . For the offline data that sample with a fixed behavior policy , we can view
which measures the quality of exploration for , and when the number of episodes , by standard concentration, we know , .
4 Offline Evaluation
In this section, we consider the offline evaluation problem, which is the basis of the offline improvement. We first introduce the plug-in estimator we consider, which is equivalent to different kinds of estimators that are widely used in practice. Then we show the error bound of the plug-in estimator, and provide the proof sketch of the error bound.
Here we consider the plug-in estimator, which first build the empirical MDP with the data:
where is the indicator function, then correspondingly define , and , , by substituting the and in , and with and . Such work can be done with dynamic programming, see Algorithm 1 for the detail.
We correspondingly introduce the reaching probabilities in the empirical MDP as:
The plug-in estimator has been studied in Duan and Wang (2020) under the assumption of linear transition, and it’s known that the plug-in estimator is equivalent to the MIS estimator proposed in Yin and Wang (2020) and a certain version of DualDICE estimator with batch update proposed in Nachum et al. (2019a), due to the observation that
4.2 Theoretical Guarantee
holds with probability at least , where is the number of episodes, is the minimum visiting probability and absorbs the polylog factors.
Theorem 4.2 shows that, even with the simplest plug-in estimator, we can match the minimax lower bound for offline evaluation in time-homogeneous MDP up to the logarithmic factors333To the best of our knowledge, no lower bound has been presented for the finite horizon time-homogeneous setting, so we provide a minimax lower bound in Theorem C.1 in Appendix., which means that, for offline evaluation with plug-in estimator, time-homogeneous MDP is not harder than the bandits.
The assumption is a mild and necessary assumption, as with only episodes, there can exist some under-explored state-action pair that can unavoidably lead to constant error, as suggested by the lower bound. Such assumption is also needed in previous work like Yin and Wang (2020); Yin et al. (2020), even though their bounds are policy-dependent in time-inhomogeneous MDP.
We remark that, although problem-dependent upper bound like Jiang and Li (2016); Kallus and Uehara (2019) and computable confidence bound like Duan and Wang (2020) are possible for finite horizon time-inhomogeneous MDP, the results cannot be directly translated to the nearly horizon-free bound, which are out of the scope of this paper.
4.3 Proof Sketch
The proof of all of the technical lemmas can be found in Appendix B. Our proof is organized as follows: we first decompose the estimation error to the errors introduced by reward estimation and transition estimation with Lemma 4.3, then Lemma 4.3 provides an upper bound for the error introduced by . For error introduced by , we first show it can be upper bounded by a total variance term (in (9)). A naive bound for the total variance will introduce an additional
factors, so we recursively upper bounded the total variance via higher moments (see Lemma4.3). Solve the recursion in Lemma 4.3, and put everything together, we eventually obtain the bound in Theorem 4.2. [Value Difference Lemma]
holds with probability at least . We then use a recursive method to bound the error introduced by , , the second term in (8).
With Bernstein’s inequality, we know that with high probability
where the second inequality is due to Cauchy-Schwartz inequality associated with the Assumption 3.4.
We then consider in (9). A naive bound of this term will lead to an extra factor. Therefore, we upper bound it with high-order variance recursively through the following lemma, [Variance Recursion] Assume , , then we have that
Again, by Bernstein’s inequality and Cauchy-Schwarz inequality, with high probability, we have
Notice that , . Now we can solve the recursion with the following lemma: For the recursion formula:
with , if , then we have that
and we need to do the recursion at most times.
We want to emphasize that, the recursion needs to be done at most times. Thus, by union bound, such recursion only introduces an additional factor in the error when , that can be absorbed by . For simplicity, we still use to denote the polylog factors in the following derivation.
5 Offline Policy Optimization
In this section, we further consider the offline policy optimization problem, which is the ultimate goal for offline reinforcement learning. We introduce the model-based planning algorithm, which is probably the simplest algorithm for offline policy optimization. Then we analyze the performance gap of the policy obtained by model-based planning to the optimal policy.
5.1 Model-Based Planning
We consider the optimal policy on the empirical MDP , which can be defined as
Notice that can be obtained by dynamic programming with the empirical MDP, which is also known as model-based planning. We illustrate the algorithm in Algorithm 2. We remark that our analysis is independent to the algorithm used for solving (14). In other words, the result also applies to the optimization-based planning with the empirical MDPs (Puterman, 2014; Mohri et al., 2012), as long as it solves (14).
5.2 Theoretical Guarantee
Theorem 5.2 provides a bound approaching the minimax lower bound up to logarithmic factors and a higher-order term, which shows that the error of offline policy optimization does not scale polynomially on the horizon. Notice that if , we can obtain an error bound of , which can be translated to sample complexity that matches the best known result of sample complexity for finite-horizon time-homogeneous setting (Zhang et al., 2020b). We conjecture that the additional factor in the higher-order term is only an artifact (see Lemma 5.3) and can be eliminated with more delicate analysis. We leave this as an open question.
5.3 Proof Sketch
We first make the following standard decomposition:
The first term characterizes the evaluation difference of optimal policy on original MDP and the empirical MDP, and the second term characterize the evaluation difference of the planning result from the empirical MDP on original MDP and the empirical MDP.
We can directly apply Theorem 4.2 to bound the first term in (5.3). However, as has complicated statistical dependency with , we cannot apply Theorem 4.2 on the second term in (5.3) for the evaluation error on data-dependent policy. Notice that a direct application of the absorbing MDP techniques introduced in Agarwal et al. (2020); Li et al. (2020) for the second term will introduce additional or factors in the main term as shown in Cui and Yang (2020). Thus, we use recursive method to keep the main term tight while only introduce an additional factor at the higher-order term, that keeps the final error horizon-free.
Similar to the case in the offline evaluation, we make the following decomposition based on Lemma 4.3:
For the inner product of with and term, as and is independent of , we can identically apply the result for offline evaluation, that leads to a error. We then consider the higher order term.
To deal with the dependency of and in the high-order term , we introduce the following lemma: Assume , , then we have that with high probability,
Lemma 5.3 have been widely used in the design and analysis of online reinforcement learning algorithms, (e.g. Zhang et al., 2020b). It holds even depends on , however, at the cost of an additional factor. This is the source of the additional factor in the higher-order term, and we believe a more fine-grained analysis can help avoid this additional factor.
With Lemma 5.3, we can apply the recursion based technique again to bound the error introduced by , and finally obtain the bound in Theorem 5.2. By Lemma 5.3 and Cauchy-Schwartz inequality, we have that
we have the recursion as