1 Introduction
Many sequential decision problems, including diabetes treatment (Bastani, 2014), digital marketing (Theocharous et al., 2015), and robot control (Lillicrap et al., 2015), are modeled as Markov decision processes (MDPs) and solved using reinforcement learning (RL) algorithms. One important problem when applying RL to real problems is policy evaluation. The goal in policy evaluation is to estimate the expected return (sum of rewards) produced by a policy. We refer to this policy as the evaluation policy, . The standard policy evaluation approach is to repeatedly deploy
and average the resulting returns. While this naïve Monte Carlo estimator is unbiased, it may have high variance.
Methods that evaluate while selecting actions according to are termed onpolicy. Previous work has addressed variance reduction for onpolicy returns (Zinkevich et al., 2006; White & Bowling, 2009; Veness et al., 2011). An alternative approach is to estimate the performance of while following a different, behavior policy, . Methods that evaluate with data generated from are termed offpolicy. Importance sampling (IS) is one standard approach for using offpolicy data in RL. IS reweights returns observed while executing such that they are unbiased estimates of the performance of .
Presently, IS is usually used when offpolicy data is already available or when executing is impractical. If is not chosen carefully, IS often has high variance (Thomas et al., 2015). For this reason, an implicit assumption in the RL community has generally been that onpolicy evaluation is more accurate when it is feasible. However, IS can also be used for variance reduction when done with an appropriately selected distribution of returns (Hammersley & Handscomb, 1964)
. While ISbased variance reduction has been explored in RL, this prior work has required knowledge of the environment’s transition probabilities and remains onpolicy
(Desai & Glynn, 2001; Frank et al., 2008; Ciosek & Whiteson, 2017). In contrast to this earlier work, we show how careful selection of the behavior policy can lead to lower variance policy evaluation than using the evaluation policy and do not require knowledge of the environment’s transition probabilities.In this paper, we formalize the selection of as the behavior policy search problem. We introduce a method for this problem that adapts the policy parameters of with gradient descent on the variance of importancesampling. Empirically we demonstrate behavior policy search with our method lowers the mean squared error of estimates compared to onpolicy estimates. To the best of our knowledge, this work is the first to propose adapting the behavior policy to obtain better policy evaluation in RL. Furthermore we present the first method to address this problem.
2 Preliminaries
This section details the policy evaluation problem setting, the Monte Carlo and Advantage Sum onpolicy methods, and importancesampling for offpolicy evaluation.
2.1 Background
We use notational standard MDPNv1 (Thomas, 2015), and for simplicity, we assume that and are finite.^{1}^{1}1The methods, and theoretical results discussed in this paper are applicable to both finite and infinite and as well as partiallyobservable Markov decision processes. Let be a trajectory and be the discounted return of trajectory . Let be the expected discounted return when the stochastic policy is used from sampled from the initial state distribution. In this work, we consider parameterized policies,
, where the distribution over actions is determined by the vector
. We assume that the transitions and reward function are unknown and that is finite.We are given an evaluation policy, , for which we would like to estimate . We assume there exists a policy parameter vector such that and that this vector is known. We consider an incremental setting where, at iteration , we sample a single trajectory with a policy and add to a set . We use to denote the set at iteration . Methods that always (i.e., ) choose are onpolicy; otherwise, the method is offpolicy. A policy evaluation method, , uses all trajectories in to estimate . Our goal is to design a policy evaluation algorithm that produces estimates of that have low mean squared error (MSE). Formally, the goal of policy evaluation with PE is to minimize . While other measures of policy evaluation accuracy could be considered, we follow earlier work in using MSE (e.g., (Thomas & Brunskill, 2016; Precup et al., 2000)).
We focus on unbiased estimators of . While biased estimators (e.g., bootstrapping methods (Sutton & Barto, 1998), approximate models (Kearns & Singh, 2002)
, etc.) can sometimes produce lower MSE estimates they are problematic for high risk applications requiring confidence intervals. For unbiased estimators, minimizing variance is equivalent to minimizing MSE.
2.2 MonteCarlo Estimates
Perhaps the most commonly used policy evaluation method is the onpolicy MonteCarlo (MC) estimator. The estimate of at iteration is the average return:
This estimator is unbiased and strongly consistent given mild assumptions.^{2}^{2}2Being a strongly consistent estimator of means that . If exists,
is strongly consistent by the Khintchine Strong law of large numbers
(Sen & Singer, 1993). However, this method can have high variance.2.3 Advantage Sum Estimates
Like the MonteCarlo estimator, the advantage sum (ASE) estimator selects for all . However, it introduces a control variate to reduce the variance without introducing bias. This control variate requires an approximate model of the MDP to be provided. Let the reward function of this model be given as . Let and , i.e., the actionvalue function and statevalue function of in this approximate model. Then, the advantage sum estimator is given by:
Intuitively, ASE is replacing part of the randomness of the Monte Carlo return with the known expected return under the approximate model. Provided is sufficiently correlated with , the variance of ASE is less than that of MC.
Notice that, like the MC estimator, the ASE estimator is onpolicy, in that the behavior policy is always the policy that we wish to evaluate. Intuitively it may seems like this choice should be optimal. However, we will show that it is not—selecting behavior policies that are different from the evaluation policy can result in estimates of that have lower variance.
2.4 Importance Sampling
Importance Sampling is a method for reweighting returns from a behavior policy, , such that they are unbiased returns from the evaluation policy. In RL, the reweighted IS return of a trajectory, , sampled from is:
The IS offpolicy estimator is then a Monte Carlo estimate of :
In RL, importance sampling allows offpolicy data to be used as if it were onpolicy. In this case the variance of the IS estimate is often much worse than the variance of onpolicy MC estimates because the behavior policy is not chosen to minimize variance, but is a policy that is dictated by circumstance.
3 Behavior Policy Search
Importance sampling was originally intended as a variance reduction technique for Monte Carlo evaluation (Hammersley & Handscomb, 1964). When an evaluation policy rarely samples trajectories with high magnitude returns a Monte Carlo evaluation will have high variance. If a behavior policy can increase the probability of observing such trajectories then the offpolicy IS estimate will have lower variance than an onpolicy Monte Carlo estimate. In this section we first describe the theoretical potential for variance reduction with an appropriately selected behavior policy. In general this policy will be unknown. Thus, we propose a policy evaluation subproblem — the behavior policy search problem — solutions to which will adapt the behavior policy to provide lower mean squared error policy performance estimates. To the best of our knowledge, we are the first to propose behavior policy adaptation for policy evaluation.
3.1 The Optimal Behavior Policy
An appropriately selected behavior policy can lower variance to zero. While this fact is generally known for importancesampling, we show here that this policy exists for any MDP and evaluation policy under two restrictive assumptions: all returns are positive and the domain is deterministic. In the following section we describe how an initial policy can be adapted towards the optimal behavior policy even when these conditions fail to hold.
Let . Consider a behavior policy such that for any trajectory, :
Rearranging the terms of this expressions yields:
(1) 
Thus, if we can select such that the probability of observing any is times the likelihood of observing then the estimate has zero MSE with only a single sampled trajectory. Regardless of , the importancesampled return will equal .
Furthermore, the policy exists within the space of all feasible stochastic policies. Consider that a stochastic policy can be viewed as a mixture policy over timedependent (i.e., action selection depends on the current timestep) deterministic policies. For example, in an MDP with one state, two actions and a horizon of there are possible timedependent deterministic policies, each of which defines a unique sequence of actions. We can express any evaluation policy as a mixture of these deterministic policies. The optimal behavior policy can be expressed similarly and thus the optimal behavior policy exists.
Unfortunately, the optimal behavior policy depends on the unknown value as well as the unknown reward function (via ). Thus, while there exists an optimal behavior policy for – which is not – in practice we cannot analytically determine . Additionally, may not be representable by any in our policy class.
3.2 The Behavior Policy Search Problem
Since the optimal behavior policy cannot be analytically determined, we instead propose the behavior policy search (BPS) problem for finding that lowers the MSE of estimates of . A BPS problem is defined by the inputs:

An evaluation policy with policy parameters .

An offpolicy policy evaluation algorithm, , that takes a trajectory, , or, alternatively, a set of trajectories, and returns an estimate of .
A BPS solution is a policy, such that offpolicy estimates with have lower MSE than onpolicy estimates. Methods for this problem are BPS algorithms.
Recall we have formalized policy evaluation within an incremental setting where one trajectory for policy evaluation is generated each iteration. At the iteration, a BPS algorithm selects a behavior policy that will be used to generate a trajectory, . The policy evaluation algorithm, OPE, then estimates using trajectories in . Naturally, the selection of the behavior policy depends on how OPE estimates .
In a BPS problem, the iteration proceeds as follows. First, given all of the past behavior policies, , and the resulting trajectories, , the BPS algorithm must select . The policy is then run for one episode to create the trajectory . Then the BPS algorithm uses OPE to estimate given the available data, . In this paper, we consider the onestep problem of selecting and estimating at iteration in a way that minimizes MSE. That is, we do not consider how our selection of will impact our future ability to select an appropriate for and thus to produce more accurate estimates in the future.
One natural question is: if we are given a limit on the number of trajectories that can be sampled, is it better to “spend” some of our limited trajectories on BPS instead of using onpolicy estimates? Since each is an unbiased estimator of , we can use all sampled trajectories to compute . Provided for all iterations, then, in expectation, a BPS algorithm will always achieve lower MSE than MC, showing that it is, in fact, worthwhile to do so. This claim is supported by our empirical study.
4 Behavior Policy Gradient Theorem
We now introduce our primary contributions: an analytic expression for the gradient of the mean squared error of the
estimator and a stochastic gradient descent algorithm that adapts
to minimize the MSE between the IS estimate and . Our algorithm — Behavior Policy Gradient (BPG) — begins with onpolicy estimates and adapts the behavior policy with gradient descent on the MSE with respect to . The gradient of the MSE with respect to the policy parameters is given by the following theorem:Theorem 1.
where the expectation is taken over .
Proof.
Proofs for all theoretical results are included in Appendix A. ∎
BPG uses stochastic gradient descent in place of exact gradient descent: replacing the intractable expectation in Theorem 1 with an unbiased estimate of the true gradient. In our experiments, we sample a batch, , of trajectories with to lower the variance of the gradient estimate at iteration . In the BPS setting, sampling a batch of trajectories is equivalent to holding fixed for iterations and then updating with the most recent trajectories used to compute the gradient estimate.
Full details of BPG are given in Algorithm 1. At iteration , BPG samples a batch, , of trajectories and adds to a data set (Lines 45). Then BPG updates with an empirical estimate of Theorem 1 (Line 6). After iterations, the estimate of is as defined in Section 2.4.
Given that the stepsize, , is consistent with standard gradient descent convergence conditions, will converge to a behavior policy that locally minimizes the variance (Bertsekas & Tsitsiklis, 2000). At best, converges to the globally optimal behavior policy within the parameterization of . Since the parameterization of determines the class of representable distributions it is possible that the theoretically optimal behavior policy is unrepresentable under this parameterization. Nevertheless, a suboptimal behavior policy still yields better estimates of , provided it decreases variance compared to onpolicy returns.
4.1 Control Variate Extension
In cases where an approximate model is available, we can further lower variance adapting the behavior policy of the doubly robust estimator (Jiang & Li, 2016; Thomas & Brunskill, 2016). Based on a similar intuition as the Advantage Sum estimator (Section 2.3), the Doubly Robust (DR) estimator uses the value functions of an approximate model as a control variate to lower the variance of importancesampling.^{3}^{3}3DR lowers the variance of perdecision importancesampling which importance samples the per timestep reward. We show here that we can adapt the behavior policy to lower the mean squared error of DR estimates. We denote this new method DRBPG for Doubly Robust Behavior Policy Gradient.
Let and recall that and are the state and action value functions of in the approximate model. The DR estimator is:
We can reduce the mean squared error of DR with gradient descent using unbiased estimates of the following corollary to Theorem 1:
Corollary 1.
where and the expectation is taken over .
The first term of is analogous to the gradient of the importancesampling estimate with replaced by . The second term accounts for the covariance of the DR terms.
AS and DR both assume access to a model, however, they make no assumption about where the model comes from except that it must be independent of the trajectories used to compute the final estimate. In practice, AS and DR perform best when all trajectories are used to estimate the model and then used to estimate (Thomas & Brunskill, 2016). However, for DRBPG, changes to the model change the surface of the MSE objective we seek to minimize and thus DRBPG will only converge once the model stops changing. In our experiments, we consider both a changing and a fixed model.
4.2 Connection to REINFORCE
BPG is closely related to existing work in policy gradient RL (c.f., (Sutton et al., 2000)) and we draw connections between one such method and BPG to illustrate how BPG changes the distribution of trajectories. REINFORCE (Williams, 1992) attempts to maximize through gradient ascent on using the following unbiased gradient of :
Intuitively, REINFORCE increases the probability of all actions taken during as a function of . This update increases the probability of actions that lead to high return trajectories. BPG can be interpreted as REINFORCE where the return of a trajectory is the square of its importancesampled return. Thus BPG increases the probability of all actions taken along as a function of . The magnitude of depends on two qualities of :

is large (i.e., a high magnitude event).

is rare relative to its probability under the evaluation policy (i.e., is large).
These two qualities demonstrate a balance in how BPG changes trajectory probabilities. Increasing the probability of a trajectory under will decrease and so BPG increases the probability of a trajectory when is large enough to offset the decrease in caused by decreasing the importance weight.
5 Empirical Study
This section presents an empirical study of variance reduction through behavior policy search. We design our experiments to answer the following questions:

Can behavior policy search with BPG reduce policy evaluation MSE compared to onpolicy estimates in both tabular and continuous domains?

Does adapting the behavior policy of the Doubly Robust estimator with DRBPG lower the MSE of the Advantage Sum estimator?

Does the rarety of actions that cause high magnitude rewards affect the performance gap between BPG and Monte Carlo estimates?
5.1 Experimental Setup
We address our first experimental question by evaluating BPG in three domains. We briefly describe each domain here; full details are available in appendix C.
The first domain is a 4x4 Gridworld. We obtain two evaluation policies by applying REINFORCE to this task, starting from a policy that selects actions uniformly at random. We then select one evaluation policy, , from the early stages of learning – an improved policy but still far from converged – and one after learning has converged, . We run all experiments once with and a second time with .
Our second and third tasks are the continuous control Cartpole Swing Up and Acrobot tasks implemented within RLLAB (Duan et al., 2016)
. The evaluation policy in each domain is a neural network that maps the state to the mean of a Gaussian distribution. Policies are partially optimized with trustregion policy optimization
(Schulman et al., 2015) applied to a randomly initialized policy.5.2 Main Results
Gridworld Experiments
Figure 1 compares BPG to Monte Carlo for both Gridworld policies, and . Our main point of comparison is the mean squared error (MSE) of both estimates at iteration over trials. For , BPG significantly reduces the MSE of onpolicy estimates (Figure 0(a)). For , BPG also reduces MSE, however, it is only a marginal improvement.
At the end of each trial we used the final behavior policy to collect more trajectories and estimate . In comparison to a Monte Carlo estimate with trajectories from , MSE is 85.48 % lower with this improved behavior policy. For , the MSE is 31.02 % lower. This result demonstrates that BPG can find behavior policies that substantially lower MSE.
To understand the disparity in performance between these two instances of policy evaluation, we plot the distribution of under (Figures 0(c) and 0(d)). These plots show the variance of to be much higher; it sometimes samples returns with twice the magnitude of any sampled by . To quantify this difference, we also measure the variance of as where the expectations are estimated with 10,000 trajectories. This evaluation is repeated 5 times per iteration and the reported variance is the mean over these evaluations. The decrease in variance for each policy is shown in Figure 0(e). The high initial variance means there is much more room for BPG to improve the behavior policy when is the partially optimized policy.
We also test the sensitivity of BPG to the learning rate parameter. A critical issue in the use of BPG is selecting the step size parameter . If is set too high we risk making too large of an update to — potentially stepping to a worse behavior policy. If we are too conservative then it will take many iterations for a noticeable improvement over Monte Carlo estimation. Figure 0(f) shows variance reduction for a number of different values in the GridWorld domain. We found BPG in this domain was robust to a variety of step size values. We do not claim this result is representative for all problem domains; stepsize selection in the behavior policy search problem is an important area for future work.
Continuous Control
Figure 2 shows reduction of MSE on the Cartpole Swingup and Acrobot domains. Again we see that BPG reduces MSE faster than Monte Carlo evaluation. In contrast to the discrete Gridworld experiment, this experiment demonstrates the applicability of BPG to the continuous control setting. While BPG significantly outperforms Monte Carlo evaluation in Cartpole Swingup, the gap is much smaller in Acrobot. This result also demonstrates BPG (and behavior policy search) when the policy must generalize across different states.
5.3 Control Variate Extensions
In this section, we evaluate the combination of modelbased control variates with behavior policy search. Specifically, we compare the AS estimator with Doubly Robust BPG (DRBPG). In these experiments we use a 10x10 stochastic gridworld. The added stochasticity increases the difficulty of building an accurate model from trajectories.
Since these methods require a model we construct this model in one of two ways. The first method uses all trajectories in to build the model and then uses the same set to estimate with ASE or DR. The second method uses trajectories from the first iterations to build the model and then fixes the model for the remaining iterations. For DRBPG, behavior policy search starts at iteration under this second condition. We call the first method “update” and the second method “fixed.” The update method invalidates the theoretical guarantees of these methods but learns a more accurate model. In both instances, we build maximum likelihood tabular models.
Figure 3 demonstrates that combining BPG with a modelbased control variate (DRBPG) can lead to further reduction of MSE compared to the control variate alone (ASE). Specifically, with the fixed model, DRBPG outperformed all other methods. DRBPG using the update method for building the model performed competitively with ASE although not statistically significantly better. We also evaluate the final learned behavior policy of the fixed model variant of DRBPG. For a batch size of trajectories, the DR estimator with this behavior policy improves upon the ASE estimator with the same model by 56.9 %.
For DRBPG, estimating the model with all data still allowed steady progress towards lower variance. This result is interesting since a changing model changes the surface of our variance objective and thus gradient descent on the variance has no theoretical guarantees of convergence. Empirically, we observe that setting the learning rate for DRBPG was more challenging for either model type. Thus while we have shown BPG can be combined with control variates, more work is needed to produce a robust method.
5.4 Rareness of Event
Our final experiment aims to understand how the gap between on and offpolicy variance is affected by the probability of rare events. The intuition for why behavior policy search can lower the variance of onpolicy estimates is that a well selected behavior policy can cause rare and high magnitude events to occur. We test this intuition by varying the probability of a rare, high magnitude event and observing how this change affects the performance gap between on and offpolicy evaluation. For this experiment, we use a variant of the deterministic Gridworld where taking the UP action in the initial state (the upper left corner) causes a transition to the terminal state with a reward of . We use from our earlier Gridworld experiments but we vary the probability of choosing UP when in the initial state. So with probability the agent will receive a large reward and end the trajectory. We use a constant learning rate of for all values of and run BPG for iterations. We plot the relative decrease of the variance as a function of over 100 trials for each value of . We use relative variance to normalize across problem instances. Note that under this measure, even when is close to , the relative variance is not equal to zero because as approaches the initial variance also goes to zero.
This experiment illustrates that as the initial variance increases, the amount of improvement BPG can achieve increases. As becomes closer to , the initial variance becomes closer to zero and BPG barely improves over the variance of Monte Carlo (in terms of absolute variance there is no improvement). When the rarely takes the high rewarding UP action ( close to ), BPG improves policy evaluation by increasing the probability of this action. This experiment supports our intuition for why offpolicy evaluation can outperform onpolicy evaluation.
6 Related Work
Behavior policy search and BPG are closely related to existing work on adaptive importancesampling. While adaptive importancesampling has been studied in the estimation literature, we focus here on adaptive importancesampling for MDPs and Markov Reward Processes (i.e., an MDP with a fixed policy). Existing work on adaptive IS in RL has considered changing the transition probabilities to lower the variance of policy evaluation (Desai & Glynn, 2001; Frank et al., 2008) or lower the variance of policy gradient estimates (Ciosek & Whiteson, 2017). Since the transition probabilities are typically unknown in RL, adapting the behavior policy is a more general approach to adaptive IS. Ciosek and Whiteson also adapt the distribution of trajectories with gradient descent on the variance (Ciosek & Whiteson, 2017) with respect to parameters of the transition probabilities. The main focus of this work is increasing the probability of simulated rare events so that policy improvement can learn an appropriate response. In contrast, we address the problem of policy evaluation and differentiate with respect to the (known) policy parameters.
The crossentropy method (CEM) is a general method for adaptive importancesampling. CEM attempts to minimize the KullbackLeibler divergence between the current sampling distribution and the optimal sampling distribution. As discussed in Section
3.1, this optimal behavior policy only exists under a set of restrictive conditions. In contrast we adapt the behavior policy by minimizing variance.Other methods exist for lowering the variance of onpolicy estimates. In addition to the control variate technique used by the Advantage Sum estimator (Zinkevich et al., 2006; White & Bowling, 2009), Veness et al. consider using common random numbers and antithetic variates to reduce the variance of rollouts in Monte Carlo Tree Search (MCTS) (2011). These techniques require a model of the environment (as is typical for MCTS) and do not appear to be applicable to the general RL policy evaluation problem. BPG could potentially be applied to find a lower variance rollout policy for MCTS.
In this work we have focused on unbiased policy evaluation. When the goal is to minimize MSE it is often permissible to use biased methods such as temporal difference learning (van Seijen & Sutton, 2014), modelbased policy evaluation (Kearns & Singh, 2002; Strehl et al., 2009), or variants of weighted importance sampling (Precup et al., 2000). It may be possible to use similar ideas to BPG to reduce bias and variance although this appears to be difficult since the bias contribution to the mean squared error is squared and thus any gradient involving bias requires knowledge of the estimator’s bias. We leave behavior policy search with biased offpolicy methods to future work.
7 Discussion and Future Work
Our experiments demonstrate that behavior policy search with BPG can lower the variance of policy evaluation. One open question is characterizing the settings where adapting the behavior policy substantially improves over onpolicy estimates. Towards answering this question, our Gridworld experiment showed that when has little variance, BPG can only offer marginal improvement. BPG increases the probability of observing rare events with a high magnitude. If the evaluation policy never sees such events then there is little benefit to using BPG. However, in expectation and with an appropriately selected stepsize, BPG will never lower the dataefficiency of policy evaluation.
It is also necessary that the evaluation policy contributes to the variance of the returns. If all variance is due to the environment then it seems unlikely that BPG will offer much improvement. For example, Ciosek and Whiteson (2017) consider a variant of the Mountain Car task where the dynamics can trigger a rare event — independent of the action — in which rewards are multiplied by . No behavior policy adaptation can lower the variance due to this event.
One limitation of gradientbased BPS methods is the necessity of good stepsize selection. In theory, BPG can never lead to worse policy evaluation compared to onpolicy estimates. In practice, a poorly selected stepsize may cause a step to a worse behavior policy at step which may increase the variance of the gradient estimate at step . Future work could consider methods for adaptive stepsizes, second order methods, or natural behavior policy gradients.
One interesting direction for future work is incorporating behavior policy search into policy improvement. A similar idea was explored by Ciosek and Whiteson who explored offenvironment learning to improve the performance of policy gradient methods (2017). The method presented in that work is limited to simulated environments with differential dynamics. Adapting the behavior policy is a potentially much more general approach.
8 Conclusion
We have introduced the behavior policy search problem in order to improve estimation of for an evaluation policy . We present a solution — Behavior Policy Gradient — for this problem which adapts the behavior policy with stochastic gradient descent on the variance of the importancesampling estimator. Experiments demonstrate BPG lowers the mean squared error of estimates of compared to onpolicy estimates. We also demonstrate BPG can further decrease the MSE of estimates in conjunction with a modelbased control variate method.
9 Acknowledgements
We thank Daniel Brown and the anonymous reviewers for useful comments on the work and its presentation. This work has taken place in the Personal Autonomous Robotics Lab (PeARL) and Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. PeARL research is supported in part by NSF (IIS1638107, IIS1617639). LARG research is supported in part by NSF (CNS1330072, CNS1305287, IIS1637736, IIS1651089), ONR (21C18401), AFOSR (FA95501410087), Raytheon, Toyota, AT&T, and Lockheed Martin. Josiah Hanna is supported by an NSF Graduate Research Fellowship. Peter Stone serves on the Board of Directors of Cogitai, Inc. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.
References

Bastani (2014)
Bastani, Meysam.
Modelfree intelligent diabetes management using machine learning
. PhD thesis, Master’s thesis, Department of Computing Science, University of Alberta, 2014.  Bertsekas & Tsitsiklis (2000) Bertsekas, Dimitri P. and Tsitsiklis, John N. Gradient convergence in gradient methods with erros. 10:627–642, 2000.
 Ciosek & Whiteson (2017) Ciosek, Kamil and Whiteson, Shimon. OFFER: Offenvironment reinforcement learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI), 2017.

Desai & Glynn (2001)
Desai, Paritosh Y and Glynn, Peter W.
Simulation in optimization and optimization in simulation: A Markov chain perspective on adaptive Monte Carlo algorithms.
In Proceedings of the 33rd conference on Winter simulation, pp. 379–384. IEEE Computer Society, 2001.  Duan et al. (2016) Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In In Proceedings of the 33rd International Conference on Machine Learning, 2016.
 Frank et al. (2008) Frank, Jordan, Mannor, Shie, and Precup, Doina. Reinforcement learning in the presence of rare events. In Proceedings of the 25th International Conference on Machine learning, pp. 336–343. ACM, 2008.
 Hammersley & Handscomb (1964) Hammersley, JM and Handscomb, DC. Monte Carlo methods, methuen & co. Ltd., London, pp. 40, 1964.
 Jiang & Li (2016) Jiang, Nan and Li, Lihong. Doubly robust offpolicy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
 Kearns & Singh (2002) Kearns, Michael and Singh, Satinder. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 49(23):209–232, 2002.
 Lillicrap et al. (2015) Lillicrap, Timothy P., Hunt, Jonathan J., Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 Precup et al. (2000) Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for offpolicy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766, 2000.
 Schulman et al. (2015) Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning ( ICML), 2015.
 Sen & Singer (1993) Sen, P.K. and Singer, J.M. Large Sample Methods in Statistics: An Introduction with Applications. Chapman & Hall, 1993.
 Strehl et al. (2009) Strehl, Alexander L, Li, Lihong, and Littman, Michael L. Reinforcement learning in finite mdps: PAC analysis. Journal of Machine Learning Research, 10:2413–2444, 2009.
 Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press, 1998.
 Sutton et al. (2000) Sutton, Richard S., McAllester, David, Singh, Satinder, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS), 2000.
 Theocharous et al. (2015) Theocharous, Georgios, Thomas, Philip S., and Ghavamzadeh, Mohammad. Personalized ad recommendation systems for lifetime value optimization with guarantees. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1806–1812, 2015.
 Thomas (2015) Thomas, Philip S. A notation for Markov decision processes. ArXiv, arXiv:1512.09075v1, 2015.
 Thomas & Brunskill (2016) Thomas, Philip S. and Brunskill, Emma. Dataefficient offpolicy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
 Thomas et al. (2015) Thomas, Philip S., Theocharous, Georgios, and Ghavamzadeh, Mohammad. High confidence offpolicy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2015.
 van Seijen & Sutton (2014) van Seijen, Harm and Sutton, Richard S. True online TD (). In Proceedings of the 31st International Conference on Machine Learning (ICML), volume 14, pp. 692–700, 2014.
 Veness et al. (2011) Veness, J., Lanctot, M., and Bowling, M. Variance reduction in MonteCarlo tree search. In Proceedings of the 24th Conference on Neural Information Processing Systems, pp. 1836–1844, 2011.
 White & Bowling (2009) White, M. and Bowling, M. Learning a value analysis tool for agent evaluation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pp. 1976–1981, 2009.
 Williams (1992) Williams, Ronald J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Zinkevich et al. (2006) Zinkevich, M., Bowling, M., Bard, N., Kan, M., and Billings, D. Optimal unbiased estimators for evaluating agent performance. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), pp. 573–578, 2006.
Appendix A Proof of Theorem 1
In Appendix A, we give the full derivation of our primary theoretical contribution — the importancesampling (IS) variance gradient. We also present the variance gradient for the doublyrobust (DR) estimator.
We first derive an analytic expression for the gradient of the variance of an arbitrary, unbiased offpolicy policy evaluation estimator, . Importancesampling is one such offpolicy policy evaluation estimator. From our general derivation we derive the gradient of the variance of the IS estimator and then extend to the DR estimator.
a.1 Variance Gradient of an Unbiased OffPolicy Policy Evaluation Method
We first present a lemma from which and can both be derived.
Lemma 1 gives the gradient of the mean squared error (MSE) of an unbiased offpolicy policy evaluation method.
Lemma 1.
Proof.
We begin by decomposing into two components—one that depends on and the other that does not. Let
and
for any such that (any such will result in the same value of ). These two definitions mean that .
The MSE of the OPE estimator is given by:
(2) 
Since the OPE estimator is unbiased, i.e., , the second term is zero and so:
(3)  
(4)  
(5) 
To obtain the gradient, we differentiate with respect to :
(6)  
(7)  
(8)  
(9)  
(10) 
Consider the last factor of the last term in more detail:
(11)  
(12)  
(13) 
where (a) comes from the multifactor product rule. Continuing from (10) we have that:
(14) 
∎
a.2 Behavior Policy Gradient Theorem
We now use Lemma 1 to prove the Behavior Policy Gradient Theorem which is our main theoretical contribution.
Theorem 2.
where the expectation is taken over .
Proof.
We first derive . Theorem 1 then follows directly from using as in Lemma 1.
where (a) comes from the multifactor product rule and using the likelihoodratio trick (i.e., )
Substituting this expression into Lemma 1 completes the proof:
∎
a.3 Doubly Robust Estimator
Our final theoretical result is a corollary to the Behavior Policy Gradient Theorem: an extension of the IS variance gradient to the Doubly Robust (DR) estimator. Recall that for a single trajectory DR is given as:
where is the statevalue function of under an approximate model, is the actionvalue function of under the model, and .
The gradient of the mean squared error of the DR estimator is given by the following corollary to the Behavior Policy Gradient Theorem:
Corollary 2.
where and the expectation is taken over .
Proof.
As with Theorem 1, we first derive . Corollary 1 then follows directly from using as in Lemma 1.
Thus the gradient is:
∎
The expression for the DR behavior policy gradient is more complex than the expression for the IS behavior policy gradient. Lowering the variance of DR involves accounting for the covariance of the sum of terms. Intuitively, accounting for the covariance increases the complexity of the expression for the gradient.
Appendix B BPG’s OffPolicy Estimates are Unbiased
This appendix proves that BPG’s estimate is an unbiased estimate of . If only trajectories from a single were used then clearly is an unbiased estimate of . The difficulty is that the BPG’s estimate at iteration depends on all for and each is not independent of the others. Nevertheless, we prove here that BPG produces an unbiased estimate of at each iteration. Specifically, we will show that is an unbiased estimate of , where the estimate is conditioned on . To make the dependence of on explicit, we will write where . We use as shorthand for .
Notice that, even though BPG’s offpolicy estimates at each iteration are unbiased, they are not statistically independent. This means that concentration inequalities, like Hoeffding’s inequality, cannot be applied directly. We conjecture that the conditional independence properties of BPG (specifically that is independent of given ), are sufficient for Hoeffding’s inequality to be applicable.
Appendix C Supplemental Experiment Description
This appendix contains experimental details in addition to the details contained in Section 5 of the paper.
Gridworld:
This domain is a 4x4 Gridworld with a terminal state with reward at , a state with reward at , a state with reward at , and all other states having reward . The action set contains the four cardinal directions and actions move the agent in its intended direction (except when moving into a wall which produces no movement). The agent begins in (0,0), , and . All policies use softmax action selection with temperature where the probability of taking an action in a state is given by:
We obtain two evaluation policies by applying REINFORCE to this task, starting from a policy that selects actions uniformly at random. We then select one evaluation policy from the early stages of learning – an improved policy but still far from converged –, , and one after learning has converged, . We run our set of experiments once with and a second time with . The ground truth value of is computed with value iteration for both .
Stochastic Gridworld:
The layout of this Gridworld is identical to the deterministic Gridworld except the terminal state is at and the reward state is at . When the agent moves, it moves in its intended direction with probability 0.9, otherwise it goes left or right with equal probability. Noise in the environment increases the difficulty of building an accurate model from trajectories.
Continuous Control:
We evaluate BPG on two continuous control tasks: Cartpole Swing Up and Acrobot. Both tasks are implemented within RLLAB (Duan et al., 2016) (full details of the tasks are given in Appendix 1.1). The single task modification we make is that in Cartpole Swing Up, when a trajectory terminates due to moving out of bounds we give a penalty of . This modification increases the variance of . We use and . Policies are represented as conditional Gaussians with mean determined by a neural network with two hidden layers of 32 tanh units each and a stateindependent diagonal covariance matrix. In Cartpole Swing Up, was learned with 10 iterations of the TRPO algorithm (Schulman et al., 2015) applied to a randomly initialized policy. In Acrobot, was learned with 60 iterations. The ground truth value of in both domains is computed with 1,000,000 Monte Carlo rollouts.
Domain Independent Details
In all experiments we subtract a constant control variate (or baseline) in the gradient estimate from Theorem 1. The baseline is and our new gradient estimate is:
Adding or subtracting a constant does not change the gradient in expectation since . BPG with a baseline has lower variance so that the estimated gradient is closer in direction to the true gradient.
We use batch sizes of trajectories per iteration for Gridworld experiments and size for the continuous control tasks. The stepsize parameter was determined by a sweep over
Early Stopping Criterion
In all experiments we run BPG for a fixed number of iterations. In general, BPS can continue for a fixed number of iterations or until the variance of the IS estimator stops decreasing. The true variance is unknown but can be estimated by sampling a set of trajectories with and computing the uncentered variance: . This measure can be used to empirically evaluate the quality of each or determine when a BPS algorithm should terminate behavior policy improvement.
Comments
There are no comments yet.