1 Introduction
Providing accurate and trustworthy estimates of a policy’s long term value in a decisionmaking environment is an important problem in reinforcement learning (RL). Typically, due to cost or safety constraints, one must perform this estimation without actually running the policy in the live environment. Instead, one must predict the value of the policy using only a limited set of experience of some other logging (or behavior) policies acting in the sequential environment. This problem is generally referred to as offpolicy evaluation (OPE) Precup (2000). The OPE problem is especially relevant to many practical domains, such as health Murphy et al. (2001); Liao et al. (2019), education Mandel et al. (2014), and recommendation systems Swaminathan et al. (2017), where accurate evaluation of a new policy is critical to maximize safety and minimize risks associated with deployment of a new policy Thomas (2015).
Perhaps the most straightforward approach to OPE is to use the given finite dataset of experience to determine the environment’s empirically observed initialization, transition, and reward probabilities, and then to evaluate the expected value of the target policy in this
empirical environment. This straightforward approach is known as the direct method (DM) Dudík et al. (2011); Voloshin et al. (2019). In addition to encompassing modelbased (MB) methods Thomas and Brunskill (2016); Hanna et al. (2017), this general paradigm is also implicitly implemented by evaluation (QE), or its parameteric counterpart fitted evaluation (FQE) Paine et al. (2020); Voloshin et al. (2019); Bradtke and Barto (1996). Indeed, the mathematical equivalence of QE and MB, even under certain function approximation schemes, has been recently demonstrated Duan and Wang (2020).Although the DM paradigm is a straightforward and intuitive approach, it is traditionally seen as undesirable due to it yielding biased estimates. That is, the estimates returned by QE or MB over multiple experiments on randomly sampled finite datasets are not centered around the true value of the target policy. This fact has led much of the OPE literature to focus on a variety of importance sampling (IS) based approaches Precup (2000); Li et al. (2011); Jiang and Li (2015); Liu et al. (2018); Nachum et al. (2019a)
, for which unbiased estimates are feasible. However, the ability to provide unbiased estimates is not necessary in many practical applications. Rather, in many practical scenarios where
safety is a key concern Thomas (2015), the ability to provide unbiased estimates is less relevant than the need for highconfidence and accurate lower or upper bounds on the true value of the target policy. Efron’s bootstrap Efron (1987) is a wellknown method in statistics for deriving confidence intervals from biased estimates, and so it may be a promising technique for use in conjunction with DM Hanna et al. (2017). Still, while bootstrapping is a simple approach widely used in statistics, it is not always guaranteed to yield accurate confidence intervals Putter and Van Zwet (2012); Abadie and Imbens (2008), and in the case of MB or QE, where the OPE estimate is a complex function of the input data, it is not immediately clear whether Efron’s bootstrap would be valid.In this paper, we investigate the validity of Efron’s bootstrap applied to DM. We derive theoretical guarantees that show that, if certain conditions are satisfied, Efron’s bootstrap applied to DM yields asymptotically accurate confidence intervals. The conditions we identify – namely, sufficient sample size and sufficient coverage of the underlying experience data distribution – may not hold in many practical scenarios. Therefore, we use insights from our derivations to suggest mechanisms – noisy rewards and regularization – for mitigating the effect of these in practice. We present empirical results in tabular settings that show the validity of our theory and the benefit of our heuristic mechanisms. Extending our methods to more complex environments with function approximation, we present stateoftheart results, showing that MB and QE with Efron’s bootstrap can yield accurate and useful confidence intervals on challenging continuous control benchmarks.
2 Background
We consider the standard Markov Decision Process (MDP) setting
Puterman (1994), in which the environment is specified by a tuple , consisting of a state space , an action space , a reward distribution function , a transition probability function , an initial state distribution , and a discount . A policy interacts with the environment iteratively, starting with an initial state . For simplicity, we will restrict the text to consider the infinitehorizon setting, although all results apply in the finite horizon setting as well.In this work, we largely focus on estimation of the value of a given target policy , defined as the expected accumulated reward of in , averaged over time via discounting:
(1) 
We consider the offpolicy setting, in which we do not have explicit knowledge of . Rather, we only have access to a finite empirical dataset of experience samples from these distributions. More concretely, we have a dataset consisting of tuples independently sampled via
(2) 
where is some unknown distribution over stateaction pairs. We will abuse notation at times and use
to denote the joint distribution on tuples and
the appropriately marginalized and conditioned distributions. The finite dataset induces its own empirical distribution over tuples, which we denote(3) 
where is the Dirac delta distribution centered at . The empirical distribution over tuples in turn determines an empirical initial state distribution , an empirical reward distribution function , and an empirical transition probability function . To appropriately define when has poor coverage of the state or action space, we define , for all such that , for some fixed prior distribution functions .
The direct method (DM) uses the empirically observed to estimate as
The direct method may be implemented explicitly through a modelbased (MB) procedure, where are either determined analytically or approximated by parameteric models via maximum likelihood. Then, is approximated by Monte Carlo trajectories of rolled out using these models. Alternatively, DM can also be implemented in a modelfree fashion via evaluation (QE). In this approach, a value function is iteratively learned via the Bellman backup procedure,
(4) 
Ignoring issues of function approximation, this procedure converges to a fixed point , which is the value function of under the empirical MDP.^{1}^{1}1When has poor coverage, the fixed point depends on the initial values . The fixed point is still the value function of under the empirical MDP, where the prior reward and transition functions are implicitly defined by the initialization of . Once this fixed point is determined, the value of may be approximated as . When the iterative procedure in (4) is performed via a regression over parameterized , this procedure is known as fitted evaluation (FQE). The reader may look to Voloshin et al. (2019) for a review of a variety of instantiations of the direct method.
Although DM via either MB or QE is straightforward, it generally yields biased estimates of :
(5) 
Still, unbiased estimates are not completely necessary in practical risksensitive applications, where one would rather have access to accurate confidence intervals, and the bias of a single point estimate is irrelevant. In the statistics literature, Efron’s bootstrap (Algorithm 1) is widely used to provide asymptotically accurate confidence intervals, even when point estimates of the statistic are biased, and doing the same for DM methods has been proposed in the past Hanna et al. (2017). However, Efron’s bootstrap is not always guaranteed to yield accurate confidence intervals Putter and Van Zwet (2012); Abadie and Imbens (2008). In this paper, we will investigate conditions under which Efron’s bootstrap applied to DM is guaranteed to yield accurate confidence intervals, and suggest mechanisms to improve the validity of the confidence intervals when these conditions do not hold.
Before getting into our main contributions, we list a few useful assumptions. For ease of exposition, we state these assumptions and our theoretical results with respect to countable sets and ; this allows us to avoid technical details from measure theory.
Assumption 1 (Bounded rewards).
The rewards of the MDP are bounded by some finite constant : .
For the next assumption, we make use of the discounted onpolicy distribution , which measures the likelihood of the policy encountering stateaction pair when interacting with Nachum and Dai (2020):
(6) 
Assumption 2 (Sufficient data coverage).
There exists such that for any , implies .
As we will discuss later, Assumption 2 is very strong and often not satisfied in practice (e.g., in infinite state or action spaces).
3 Investigating the Validity of Efron’s Bootstrap
We begin by presenting a theoretical result showing the validity of using Efron’s bootstrap based on estimates prescribed by the direct method.
Theorem 1 (Correctness of DM with bootstrapping).
Under Assumptions 1,2, the use of Algorithm 1 with yields confidence intervals which are asymptotically correct, in the sense that
(7) 
where is used to denote order in probability. Additionally, the onesided confidence intervals are asymptotically correct at rate . These asymptotic rates may be improved by using more sophisticated bootstrapping methods in place of Algorithm 1, such as BCa or ABC DiCiccio and Efron (1996).
Proof.
(Sketch) First, it is clear by the definition of in (2) and Assumption 2 that . Thus it is left to show that bootstrap yields correct intervals around . Sufficient conditions for correctness of Efron’s biascorrected bootstrap are known, and they are given by smoothness (specifically, Hadamard differentiability^{2}^{2}2See the appendix for a definition of Hadamard differentiability.) of the functional evaluated in a neighborhood (i.e., a sufficiently small ball) around the true distribution Wasserman (2006); Politis et al. (2012); Hall (2013). In the appendix, we show that under the assumption of bounded rewards (Assumption 1) the derivative for general distribution satisfies
(8) 
where is the discounted onpolicy distribution of under . When , we have . In the appendix, we show that is bounded within a sufficiently small neighborhood of , given sufficient coverage of (Assumption 2), and this completes the proof. ∎
Although necessary conditions for the validity of Efron’s bootstrap are not known in general, Hadamard differentiability is the key property typically used to prove validity. Our derivations make it clear that Assumption 2 is necesssary to ensure Hadamard differentiability of ; otherwise, a small change in may take out of the support of , causing divergence in the derivative (8). In contrast, a weaker variant of this assumption, , which appears in previous OPE literature Nachum et al. (2019a), is not sufficiently strong to guarantee differentiability in the neighborhood of . We encapsulate this in the following theorem.
Theorem 2 (Necessity of Assumption 2).
Suppose Assumption 1 holds and define functional . There exists with uniformly bounded ratios such that is not Hadamard differentiable within any neighborhood of .
Proof.
See the appendix. ∎
Theorem 2 is somewhat disappointing, as Assumption 2 is strong and often not satisfied in practice; in continuous state or action settings, it is almost never satisfied.
In addition to the need for Assumption 2, the other major lacking of Theorem 1 is that it only guarantees correct intervals asymptotically. For any finite , the confidence intervals yielded by Efron’s bootstrap will generally exhibit undercoverage, and in practice this can lead to overly confident confidence intervals. Indeed, in the extreme case of , there will be no variation in the boostrapped estimates of leading to confidence intervals that are single points.
In the following subsections, we elaborate on our suggested mechanisms for appropriately compensating for these two main theoretical shortcomings of Efron’s bootstrap applied to DM.
3.1 Regularizations for Insufficient Coverage
To better understand the need for sufficient coverage, we can look at a simple scenario illustrated in Figure 1. If the data distribution includes but does not cover the action chosen by the policy, then the estimates will be set to the priors . However, if the data distribution includes state action pair with even a tiny probability, the estimates are changed to their empirical estimates. In general, this change is not smooth (i.e., not Hadamard differentiable) with respect to the underlying data distribution, and this leads to issues with the validity of Efron’s bootstrap applied to .
It is thus clear that to ensure validity of Efron’s bootstrap, we require estimates that are smoother around . For example, smoother empirical reward and transition functions may be found by defining biased reward and transitions in terms of some fixed ,
(9)  
(10) 
These biased functions would yield a regularized DM estimate:
(11) 
This estimator is provably amenable to statistical bootstrapping regardless of data coverage, although at the cost of providing intervals for a biased estimate of , as stated by the following theorem.
Theorem 3 (Correctness of regularized DM with bootstrapping).
Proof.
See the appendix. ∎
For succinctness, we have expressed Theorem 3 in terms of the specific defined above. In general, the guarantees of the theorem hold for any suitably smooth , i.e., reward and transition functions that are locally differentiable around ; see the appendix for details. This more general result is promising for function approximation settings. In such settings, when using modelbased evaluation or fitted
evaluation, it is straightforward to smooth out the estimated reward and transition functions via a number of standard regularizations. For example, in our experiments with neural network function approximators, we utilize standard weight decay, which acts as a regularization towards prior reward and transition functions implicitly defined by the network structure.
3.2 Noisy Rewards
Even with sufficient coverage or appropriate regularization, the computed confidence intervals will generally be overconfident and undercover the true value, especially in lowdata regimes. This is due to the fact that for finite
, the empirical variance of the functional
over the bootstrapped datasets is in general an underestimate of the true variance.To incorporate additional variance, we propose to augment the dataset via perturbations applied to observed rewards,
(13) 
Note that the variance of the empirical dataset is increased to . Given the augmented dataset , one may perform Algorithm 1 asis, sampling bootstrapped datasets each of elements. This same technique of augmenting a dataset with noisy rewards has been used in the bandit literature as a way to perform better exploration Kveton et al. (2019, 2018). As in this previous literature, a large enough would be sufficient to compensate for the inherent undercoverage in bootstrapping, although in practice a much smaller can still yield good coverage.
With noisy rewards, we are able to compensate for the undercoverage of Theorems 1 and 3. However, this generally comes at the cost of overcoverage. In practice, the parameter provides a way to tradeoff between safety in small data regimes and looseness of the confidence intervals. In our experiments, we found that setting provides a reasonable tradeoff for our considered environments.
4 Related Work
Our paper focuses on producing confidence bounds for offpolicy evaluation and therefore follows a long line of work on highconfidence policy evaluation (HCOPE) Thomas et al. (2015a). Many of the existing methods for HCOPE focus on importance sampling (IS) based estimators, in which the rewards of a trajectory are reweighted according to an inverse propensity ratio to yield an unbiased estimate of Precup (2000). Given a dataset with several trajectories, one may derive several unbiased estimates and then use concentration inequalities to derive highconfidence lower and upper bounds on the true average Thomas et al. (2015a). Since these concentration inequalities typically require unbiased estimates, they are not applicable to the direct method.
In terms of statistical bootstrapping, there have been several instances of its use for offpolicy evaluation. Specifically, Thomas et al. (2015b) combined statistical bootstrapping with IS to derive OPE confidence intervals. Unlike for DM, the validity of Efron’s bootstrap with IS is straightforward, since the functional in this case is the standard mean. We are aware of one previous instance in which statistical bootstrapping was used for highconfidence policy evaluation with DM; specifically, Hanna et al. (2017) proposes to use Efron’s bootstrap in conjunction with modelbased learning, similar to the present work. However, the validity of using Efron’s bootstrap is not addressed in this previous work. The theoretical investigation we presented is a key contribution of our paper. Notably, we found that the use of Efron’s bootstrap directly is misguided without the use of strong assumptions, or alternatively, as we suggest, the use of mechanisms like regularization and noisy rewards. Furthermore, our experimental work presents strong results on continuous control benchmarks, while previous work mostly focuses on tabular domains.
Outside of the narrow scope of HCOPE, the ideas behind Efron’s bootstrap have inspired a number of existing RL algorithms. Specifically, statistical bootstrapping has been proposed as a mechanism for exploration; e.g., bootstrapped DQN Osband et al. (2016, 2017). However, in practice, the type of bootstrapping performed in these algorithms is far from that prescribed by Efron’s bootstrap. Usually, an ensemble of models is learned over the whole dataset, without any resampling or bias correction, and thus the theory behind bootstrap does not readily apply. Although this simple paradigm has achieved impressive results on hard exploration environments Nachum et al. (2019b), in our initial experiments for offpolicy evaluation we found the naive ensembling approach to yield poor confidence intervals. In the bandits literature, ideas from statistical bootstrapping have also been investigated as an exploration mechanism Kveton et al. (2018); Hao et al. (2019). While we have focused on policy evaluation, extending the insights and derivations of the present paper to propose better algorithms for exploratory policy learning (or, conversely, safe policy learning) is an interesting avenue for future work.
5 Experiments
We evaluate our methods first in a discrete tabular domain, where we investigate how well the coverage of the estimated bootstrap intervals matches the intended coverage and show how reward noise can assist in lowdata regimes. Sufficient coverage is not much of an issue in finite domains,^{3}^{3}3In finite domains, Assumption 2 reduces to . and so we continue to a more difficult set of continuous control tasks from OpenAI Gym Brockman et al. (2016), where we evaluate the use of appropriately regularized function approximators in conjunction with bootstrapping and noisy rewards.
5.1 Tabular Tasks
We use Frozen Lake as a discrete domain for tabular experiments. In this environment, the agent navigates in a discrete world from a start state to a goal state. The environment dynamics are stochastic and some actions lead to episode terminations. We use . We use a target policy that is nearoptimal in this domain. We collect an experience dataset using a behavior policy derived as the target policy injected with 0.2 greedy noise (this reduces the value of the policy from about to about ). For this task, policy evaluation with DM with either MB or QE can be equivalently solved using the exact tabular method, so we plot a single variant labelled DM.
We present empirical results in Figure 2. We plot the results of using Efron’s bootstrap with DM to construct confidence intervals with confidence across a number of dataset sizes. The results here show empirical coverage of the estimated confidence intervals, as measured over 200 randomly sampled datasets (each dataset is then resampled repeatedly for computing bootstrap estimates). We find that DM with bootstrapping is able to achieve nearcorrect empirical coverage as the dataset size grows. As suggested by the theory, the bootstrap typically underestimates the desired coverage, and this is severe in lowdata regimes (when the number of episodes is less than ).
We show the results of using noisy rewards to combat this lowdata issue. We perturb the rewards with . The resulting difference in performance in the lowdata regime is striking; DM with noisy bootstrap is able to yield nearoptimal coverage, although, as expected, it typically slightly overestimates the desired coverage.
As a point of comparison, we plot a number of other highconfidence policy evaluation methods: IS with bootstrapping, IS with empirical Bernstein’s, IS with Student’s , IS with Hoeffding’s, and doubly robust (DR) IS with bootstrapping (see Thomas et al. (2015a, b); Hanna et al. (2017)). We find that all of these previous methods mostly either severely underestimate or severely overestimate the desired coverage. There is a potential for our proposed noisy rewards to be beneficial for some of these baselines as well (e.g., DR bootstrap), and this is a promising avenue for future work.
. For FQE with noisy boostrapping, the noise scale corresponds to a coefficient applied to the standard deviation of observed rewards in the dataset. Some of the variants (FQE without weight decay, IS bootstrap) at times produce intervals which are wholly outside the plotted range.
5.2 Continuous Control Tasks
We now evaluate the use of bootstrapping on continuous control tasks from OpenAI gym Brockman et al. (2016). Due to high computational demands, we focus on Reacher, HalfCheetah, and Hopper. We follow a protocol similar to Nachum et al. (2019a). First, we generate a nearoptimal policy by training SAC Haarnoja et al. (2018). The target policy is set to be this nearoptimal policy with fixed variance . The datasets are sampled via a suboptimal policy derived from the nearoptimal policy with variance replaced with a fixed quantity ( or
depending on the task). We train all networks for one million steps using stochastic gradient descent via the Adam optimizer
Kingma and Ba (2014) with learning rate and a minibatch size of . As a form of regularization to combat issues with sufficient coverage, we apply weight decay (L2 regularization) equal to to all methods, unless otherwise specified.We present the computed intervals of FQE and MB in Figure 3. Focusing first on the effect of reward noise, we look at the ablation presented by the three FQE variants in these plots (see the appendix for an ablation over MB variants). In extreme low data regimes (10 trajectories), the variance of vanilla FQE intervals is large and coverage of the true value suffers (especially for Reacher). With increased reward noise scales, coverage of the true value improves, but at the cost of a wider interval at times.
Next, we consider the issue of sufficient coverage. By default, we apply L2 regularization to FQE. In Figure 3 we present a variant without L2 regularization. We find the absence of this regularization to have a detrimental effect on performance. At times, the intervals computed by unregularized FQE are so inaccurate that they are outside the range of the plot. We find that the regularized version of FQE exhibits more stable performance. We found regularization of MB to also be crucial. The MB method plotted here uses L2 regularization on the weights and clips states and rewards generated during modelbased rollouts. Although not plotted, we found that without these regularizations, the MB bootstrap intervals diverge. In some instances, we can see the consequences of these strong regularizations in terms of biased intervals that do not cover the true value, such as in HalfCheetah.
Overall, we conclude that DM approaches when using bootstrapping and our proposed mechanisms can yield strong performance in these difficult domains. Between FQE and MB, FQE appears to be better suited for these domains, although both methods show substantial improvement over existing approaches (IS with bootstrapping).^{4}^{4}4DR with bootstrap produces even worse intervals, and so we do not plot it.
6 Conclusion
We have investigated the validity of Efron’s bootstrap for computing confidence intervals with respect to the direct method (DM) for offpolicy evaluation. Our theoretical results show that Efron’s bootstrap is valid given that specific conditions – sufficient data size and sufficient coverage – are satisfied. While these conditions are often not satisfied in practice, there are a number of heuristic mechanisms that can be employed to mitigate their effects, although at a cost of overly conservative or biased intervals. Still, empirically we find that these mechanisms can be used to yield impressive performance for OPE in challenging environments. In the future, we hope to use the ideas and techniques presented here and apply them to policy optimization problems, where safety is also a key concern.
Broader Impact
Our work focuses on the practically relevant problem of offpolicy evaluation. Interestingly, our work reveals the potential issues with applying a wellknown technique – Efron’s bootstrap – without considering its validity. Our work shows that Efron’s bootstrap may often not be valid. Although we propose mechanisms to remedy this, our solutions are not foolproof. In a practical setting, where many of our assumptions may not hold, one must take special care when applying our method to mitigate risks of failure.
Thanks to Jonathan Tompson, Andy Zeng, Branislav Kveton, and others at Google Research for contributing helpful thoughts and discussions.
References
 On the failure of the bootstrap for matching estimators. Econometrica 76 (6), pp. 1537–1557. Cited by: §1, §2.
 Linear leastsquares algorithms for temporal difference learning. Machine learning 22 (13), pp. 33–57. Cited by: §1.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §5.2, §5.
 Bootstrap confidence intervals. Statistical science, pp. 189–212. Cited by: Theorem 1.
 Minimaxoptimal offpolicy evaluation with linear function approximation. arXiv preprint arXiv:2002.09516. Cited by: §1.
 Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §1.
 Better bootstrap confidence intervals. Journal of the American statistical Association 82 (397), pp. 171–185. Cited by: §1, Algorithm 1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §5.2.
 The bootstrap and edgeworth expansion. Springer Science & Business Media. Cited by: §3.
 High confidence offpolicy evaluation with models. arXiv preprint arXiv:1606.06126. Cited by: §A.6.

Bootstrapping with models: confidence intervals for offpolicy evaluation.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, Cited by: §1, §1, §2, §4, §5.1.  Bootstrapping upper confidence bound. In Advances in Neural Information Processing Systems, pp. 12123–12133. Cited by: §4.
 Doubly robust offpolicy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
 Perturbedhistory exploration in stochastic multiarmed bandits. arXiv preprint arXiv:1902.10089. Cited by: §3.2.
 Garbage in, reward out: bootstrapping exploration in multiarmed bandits. arXiv preprint arXiv:1811.05154. Cited by: §3.2, §4.
 Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 297–306. Cited by: §1.
 Offpolicy estimation of longterm average outcomes with applications to mobile health. arXiv preprint arXiv:1912.13088. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §A.6.
 Breaking the curse of horizon: infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366. Cited by: §1.
 Offline policy evaluation across representations with applications to educational games. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pp. 1077–1084. Cited by: §1.
 Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96 (456), pp. 1410–1423. Cited by: §1.
 DualDICE: behavioragnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: §1, §3, §5.2.
 Reinforcement learning via fenchelrockafellar duality. arXiv preprint arXiv:2001.01866. Cited by: §2.
 Why does hierarchy (sometimes) work so well in reinforcement learning?. arXiv preprint arXiv:1909.10618. Cited by: §4.
 Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §4.
 Deep exploration via randomized value functions. Journal of Machine Learning Research. Cited by: §4.
 Hyperparameter selection for offline reinforcement learning. External Links: 2007.09055 Cited by: §1.
 Subsampling. Springer Science & Business Media. Cited by: §3.
 Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §1, §1, §4.
 Markov decision processes: discrete stochastic dynamic programming. Cited by: §2.
 Resampling: consistency of substitution estimators. In Selected Works of Willem van Zwet, pp. 245–266. Cited by: §1, §2.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: 3rd item.
 Offpolicy evaluation for slate recommendation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3635–3645. Cited by: §1.
 Dataefficient offpolicy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2139–2148. Cited by: §1.
 High confidence offpolicy evaluation. In Proceedings of the 29th Conference on Artificial Intelligence, Cited by: §4, §5.1.
 High confidence policy improvement. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2380–2388. Cited by: §4, §5.1.
 Safe reinforcement learning. Ph.D. Thesis, University of Massachusetts Libraries. Cited by: §1, §1.
 Empirical study of offpolicy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854. Cited by: §1, §2.
 All of nonparametric statistics. Springer Science & Business Media. Cited by: §A.1, §A.2, §3.
Appendix A Proofs
a.1 Hadamard Differentiability
We provide a definition of Hadamard differentiability, which is a key property for showing validity of Efron’s bootstrap. The following is paraphrased from [40].
Definition 1.
Suppose is a functional mapping distributions over (i.e., distributions of tuples ) to . Denote as the the linear space generated by . The functional is said to be Hadamard differentiable at if there exists a linear functional on such that for any and such that and ,
(14) 
a.2 Proof of Theorem 1
As in the main text, we use to denote the initial state, conditional reward, and conditional transition distributions observed in . Furthermore, let . For ease of notation, we will use matrix notation and assume finite state and action spaces (an extension to Hilbert spaces with linear operators is straightforward). The functional may be expressed as,
(15) 
where we use to denote the matrix mapping distributions over states to distributions over state actions, where actions as sampled by .
Note that, assuming , the components of this expression for at yield the values and onpolicy distribution . Specifically,
(16)  
(17) 
For general , these expressions will yield , the onpolicy distribution in the empirical MDP, and , the values in the empirical MDP, respectively.
As mentioned in the proof sketch, the validity of Theorem 1 rests on the Hadamard differentiability of for all in a neighborhood around . In addition to local Hadamard differentiability, one must also have that the derivative linear functional satisfy
(18) 
See Theorems 3.19 and 3.21 in [40] for more information. In the text below, we will show that is indeed Hadamard differentiable with derivative satisfying
(19) 
The result (19) in conjunction with Assumption 2 will immediately make it clear that . Moreover, the linear nature of the functional with respect to makes it clear that , thus showing the validity of the bootstrap.
We now continue to characterize the linear functional . We will first derive, via standard Frechet differentiation, the derivatives of with respect to , , and , for that satisfy Assumption 1. We will later use these results in conjunction with Assumption 2 to show the Hadamard differentiability of with respect to in a ball around .

: It is clear from (16) that .

: It is clear from (17) that .

: This derivation is not as trivial as the previous two. Still, it may be approached in a straightforward manner by utilizing the policy gradient theorem [33]. Although the policy gradient theorem is typically used to derive gradients of with respect to , we may apply it here, interpreting as the stationary “policy” whose gradient we wish to calculate (“transitions” are now between stateaction pairs and the “actions” are choices of next states). Specifically, we may rewrite (15) as
(20) where is constant with respect to . This way, we deduce that for all .
With these three partial derivatives calculated, we may continue to show differentiability of in a neighborhood around . Without loss of generality, we assume that has full support; if not, we may simply ignore all tuples outside of the support, since they do not affect or (note that by Assumption 2 this means also has full support).
Now we continue to characterize the derivative of . Denote the derivative of by , where is defined to be the linear functional satisfying,
(21) 
for all tuples . We analyze the behavior of these directional limits. We again split our analysis into three parts:

Influence of . The influence of is in the empirical average reward function at : . At a change of , this value is updated to
(22) The derivative of this expression at is . Combined with the partial derivative computed earlier, we find the total influence on is as .

Influence of . The influence of is in the empirical initial state distribution , which is updated to . To deduce the influence on , we combine with the partial derivative computed earlier, and find the change in to be .

Influence of . As for the reward, the influence here is in the empirical transition probabilities , which is updated to
(23) The derivative of this expression at is . Combining this with the known partials of with respect to , we find that the total influence on is as
We may see that each of these influences on are linear in . By Assumption 1, is uniformly bounded, as are and . Thus, in conjunction with the Riesz representation theorem, we deduce that the derivative satisfies
(24) 
Now consider an arbitrary distribution and the directional limit
(25) 
Analogous to the derivations above, we may find,

The empirical average reward at a change of is updated to
(26) 
The empirical initial state distribution at a change of is updated to
(27) 
The empirical transition probabilities at a change of are updated to
(28)
By considering the limits of (26), (27), (28) as , it is clear that
(29) 
To show Hadamard differentiability, we invoke Assumption 2, which implies that there exists a sufficiently small such that the ball centered at with radius has uniformly bounded . Since the support of is contained within the support of , this means that the same ball has uniformly bounded . Moreover, it is clear that within this ball uniformly, and so the directional derivatives of (26), (27), and (28) converge uniformly with . Thus, there exists a sufficiently small ball around within which is Hadamard differentiable. This completes our proof.
a.3 Proof of Theorem 2
First, a brief sketch: If Assumption 2 does not hold, then for any ball, one may find a distribution near outside of the support of , and this will cause discontinuities in .
Now more concretely: Consider an MDP with state space . The MDP’s initial state distribution is . The MDP has a single action and the transition function is defined as,
(30)  
(31)  
Comments
There are no comments yet.