1 Introduction
Reinforcement learning (RL) algorithms based on the policy gradient theorem (Sutton et al., 2000; Marbach and Tsitsiklis, 2001) have recently enjoyed great success in various domains, e.g., achieving humanlevel performance on Atari games (Mnih et al., 2016). The original policy gradient theorem is onpolicy and used to optimize the onpolicy objective. However, in many cases, we would prefer to learn offpolicy to improve data efficiency (Lin, 1992) and exploration (Osband et al., 2018). To this end, the OffPolicy Policy Gradient (OPPG) Theorem (Degris et al., 2012; Maei, 2018; Imani et al., 2018) was developed and has been widely used (Silver et al., 2014; Lillicrap et al., 2015; Wang et al., 2016; Gu et al., 2017; Ciosek and Whiteson, 2017; Espeholt et al., 2018).
Ideally, an offpolicy algorithm should optimize the offpolicy analogue of the onpolicy objective. In the continuing RL setting, this analogue would be the performance of the target policy in expectation w.r.t. the stationary distribution of the target policy, which is referred to as the alternative life objective (White, 2018; Ghiassian et al., 2018). This objective corresponds to the performance of the target policy when deployed. However, OPPG optimizes a different objective, the performance of the target policy in expectation w.r.t. the stationary distribution of the behavior policy. This objective is referred to as the excursion objective (White, 2018; Ghiassian et al., 2018), as it corresponds to the excursion setting (Sutton et al., 2016). Unfortunately, the excursion objective can be misleading about the performance of the target policy when deployed, as we illustrate in Section 3.
It is infeasible to optimize the alternative life objective directly in the offpolicy continuing setting. Instead, we propose to optimize the counterfactual objective, which approximates the alternative life objective. In the excursion setting, an agent in the stationary distribution of the behavior policy considers a hypothetical excursion that follows the target policy. The return from this hypothetical excursion is an indicator of the performance of the target policy. The excursion objective measures this return w.r.t. the stationary distribution of the behavior policy, using samples generated by executing the behavior policy. By contrast, evaluating the alternative life objective requires samples from the stationary distribution of the target policy, to which the agent does not have access. In the counterfactual objective, we use a new parameter to control how counterfactual the objective is, akin to (Gelada and Bellemare, 2019). With , the counterfactual objective uses the stationary distribution of the behavior policy to measure the performance of the target policy, recovering the excursion objective. With , the counterfactual objective is fully decoupled from the behavior policy and uses the stationary distribution of the target policy to measure the performance of the target policy, recovering the alternative life objective. As in the excursion objective, the excursion is never actually executed and the agent always follows the behavior policy.
Our contributions are threefold. First, we introduce the counterfactual objective. We motivate this objective empirically with an example MDP that highlights the difference between the alternative life objective and the excursion objective. We also motivate it theoretically, by proving that the counterfactual objective can recover both the excursion objective and the alternative life objective smoothly via manipulating . Second, we prove the Generalized OffPolicy Policy Gradient (GOPPG) Theorem, which gives the policy gradient of the counterfactual objective. Third, using an emphatic approach (Sutton et al., 2016)
to compute an unbiased sample for this policy gradient, we develop the Generalized OffPolicy ActorCritic (GeoffPAC) algorithm. We evaluate GeoffPAC empirically in challenging robot simulation tasks with neural network function approximators. GeoffPAC significantly outperforms the actorcritic algorithms proposed by
Degris et al. (2012); Imani et al. (2018), and to our best knowledge, GeoffPAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.2 Background
We use a timeindexed capital letter (e.g.,
) to denote a random variable. We use a bold capital letter (e.g.,
X) to denote a matrix and a bold lowercase letter (e.g., x) to denote a column vector. If
is a scalar function defined on a finite set , we use its corresponding bold lowercase letter to denote its vector form, i.e., . We use Ito denote the identity matrix and
1 to denote an allone column vector.We consider an infinite horizon MDP (Puterman, 2014) consisting of a finite state space , a finite action space , a bounded reward function and a transition kernel . We consider a transitionbased discount function (White, 2017) for unifying continuing tasks and episodic tasks. At time step , an agent at state takes an action according to a policy . The agent then proceeds to a new state according to and gets a reward satisfying . The return of at time step is , where and . We use to denote the value function of , which is defined as . Like White (2017), we assume exists for all . We use to denote the stateaction value function of . We use to denote the transition matrix induced by , i.e., . We assume is ergodic and use to denote the corresponding stationary distribution. We define .
In the offpolicy setting, an agent aims to learn a target policy but follows a behavior policy . We use the same assumption of coverage as Sutton and Barto (2018), i.e., . We assume is ergodic and use to denote its stationary distribution. Similarly, . We define , and .
Typically, there are two kinds of tasks in RL, prediction and control.
Prediction: In prediction, we are interested in finding the value function of a given policy . Temporal Difference (TD) learning (Sutton, 1988) is perhaps the most popular algorithm for prediction. TD enjoys convergence guarantee in both on and offpolicy tabular settings. TD can also be combined with linear function approximation. The update rule for onpolicy linear TD is , where is a step size and is an incremental update. Here we use
to denote an estimation of
parameterized by . Tsitsiklis and Van Roy (1997) prove the convergence of onpolicy linear TD. In offpolicy linear TD, the update is weighted by . The divergence of offpolicy linear TD is well documented (Tsitsiklis and Van Roy, 1997). To approach this issue, Gradient TD (GTD, Sutton et al. 2009) was proposed. Instead of bootstrapping from the prediction of a successor state like TD, GTD computes the gradient of the projected Bellman error directly. GTD is a true stochastic gradient method and enjoys convergence guarantees. However, GTD is a twotimescale method, involving two sets of parameters and two learning rates, which makes it hard to use in practice. To approach this issue, Emphatic TD (ETD, Sutton et al. 2016) was proposed.ETD introduces an interest function to specify user preferences for different states. With function approximation, we typically cannot get accurate predictions for all states and must thus trade off between them. States are usually weighted by in the offpolicy setting (e.g., GTD) but with the interest function, we can explicitly weight them by in our objective and/or weight the update at time via , where is the emphasis that accumulates previous interests in a certain way. In the simplest form of ETD, we have , where and is a constant. The update is weighted by . In practice, we usually set .
Inspired by ETD, Hallak and Mannor (2017) propose to weight via in the Consistent OffPolicy TD (COPTD) algorithm, where is the density ratio, which is also known as the covariate shift (Gelada and Bellemare, 2019). To learn via stochastic approximation, Hallak and Mannor (2017) propose the COP operator. However, the COP operator does not have a unique fixed point, and extra normalization and projection is used to ensure convergence (Hallak and Mannor, 2017). To address this limitation, Gelada and Bellemare (2019) further propose the discounted COP operator.
Gelada and Bellemare (2019) define a new transition matrix where is a constant. Following this matrix, an agent either proceeds to the next state according to w.p. or gets reset to w.p. . Gelada and Bellemare (2019) prove that is ergodic and
(1) 
is the stationary distribution of when . However, it is not clear whether holds or not. With , Gelada and Bellemare (2019) prove that
(2) 
yielding the learning rule
(3) 
where is an estimate of and is a step size. A semigradient is used when is a parameterized function (Gelada and Bellemare, 2019). For small (depending on the difference between and ), Gelada and Bellemare (2019) prove contraction for linear function approximation. For large or nonlinear function approximation, they provide an extra normalization loss for the sake of the constraint . Gelada and Bellemare (2019) use to weight the update in Discounted COPTD. They demonstrate empirical success in Atari games (Bellemare et al., 2013) with pixel inputs.
Control: In this paper, we focus on policybased control. In the onpolicy continuing setting, we seek to optimize the objective
(4) 
which is equivalent to optimizing the average reward if both and are constant (White, 2017). We usually set . We assume is parameterized by . In the rest of this paper, all gradients are taken w.r.t. unless otherwise specified, and we consider the gradient for only one component of for simplicity.
In the offpolicy continuing setting, Degris et al. (2012) propose to optimize the excursion objective
(5) 
instead of the alternative life objective . We can compute the policy gradient as
(6) 
Degris et al. (2012) prove in the OffPolicy Policy Gradient (OPPG) theorem that we can ignore the term without introducing bias for a tabular policy^{1}^{1}1See Errata in Degris et al. (2012), also in Imani et al. (2018); Maei (2018). when , yielding a gradient update , where is sampled from and is sampled from . Based on this, Degris et al. (2012) propose the OffPolicy ActorCritic (OffPAC) algorithm. For a policy using a general function approximator, Imani et al. (2018) propose a new OPPG theorem. They define
where
is a constant used to optimize the biasvariance tradeoff and
, and prove that is an unbiased sample of when and for a general interest function . Based on this, Imani et al. (2018) propose the ActorCritic with Emphatic weightings (ACE) algorithm. ACE is an emphatic approach where is the emphasis to reweight the update.3 The Counterfactual Objective
(a) The twocircle MDP. Rewards are 0 unless specified on the edge (b) The probability of transitioning to
B from A under target policy during training (c) The influence of and on the final solution found by GeoffPAC.We now introduce the counterfactual objective
(7) 
where is a userdefined interest function. Similarly, we can set to 1 for the continuing setting but we proceed with a general . When , recovers the alternative life objective . When , recovers the excursion objective . To motivate the counterfactual objective , we first present the twocircle MDP (Figure (a)a) to highlight the difference between and .
In the twocircle MDP, an agent only needs to make a decision in state A. The behavior policy proceeds to B or C randomly with equal probability. The discount factor is set to for all transitions. We consider a continuing setting and set . Obviously, and hardly change w.r.t. due to discounting, and we have and . To maximize , the target policy would prefer transitioning to state C to maximize . However, the policy maximizing (i.e., maximizing the average reward) would prefer transitioning to state B, which is what we usually want in an onpolicy setting. Hence, maximizing the excursion objective gives an unexpected solution. This effect can also occur with for larger if the path is longer. With function approximation, the discrepancy can be magnified due to state aliasing, where we may want to make a tradeoff between different states according to instead of (White, 2018).
One solution to this problem is to set the interest function in in a clever way. However, it is not clear how to achieve this without domain knowledge. Imani et al. (2018) simply set to 1. Another solution might be to optimize directly in offpolicy learning, if one could use importance sampling ratios to fully correct to as Precup et al. (2001) propose for valuebased methods in the episodic setting. However, this solution suffers from high variance and is infeasible for the continuing setting (Sutton et al., 2016).
In this paper, we propose to optimize instead. As we prove below, we have , indicating when approaches 1, we can get arbitrarily close to . Furthermore, we show empirically that a small (e.g., 0.6 in the twocircle MDP) is enough to generate a different solution from maximizing .
Lemma 1
Assuming is ergodic, the sequence converges to uniformly when , for , where is a constant and
Proof. The pointwise convergence for each is a standard conclusion for ergodic MDPs (Theorem 4.9 in Levin and Peres 2017). To prove uniform convergence, we need a independent bound on the distance between and . Details are provided in supplementary materials.
Theorem 1
Assuming is ergodic and , then ,
4 Generalized OffPolicy Policy Gradient
In this section, we derive an estimator for and show in Proposition 1 that it is unbiased. Our (standard) assumptions are given in supplementary materials. The OPPG theorem (Imani et al., 2018) leaves us the freedom to choose the interest function in . In this paper, we set , which, to our best knowledge, is the first time that a nontrivial interest is used. Hence, depends on and we cannot invoke OPPG directly as . However, we can still invoke the remaining parts of OPPG:
(8) 
where . We now compute the gradient .
Theorem 2 (Generalized OffPolicy Policy Gradient Theorem)
where
Proof. We first use the product rule of calculus and plug in :
follows directly from (8). To show , we take gradients on both sides of (2). We have . Solving this linear system of leads to
With , follows easily.
Now we use an emphatic approach to provide an unbiased sample of . We define
Here functions as an intrinsic interest (in contrast with the userdefined extrinsic interest ) and is a sample for b. accumulates previous interests and translates b into g. is for biasvariance tradeoff similar to Sutton et al. (2016); Imani et al. (2018). We now define
and proceed to show that is an unbiased sample of when .
Lemma 2
With , we have for .
Proof. The proof follows similar techniques as Sutton et al. (2016); Hallak and Mannor (2017) and is provided in supplementary materials.
Proposition 1
With a fixed , , we have
Proof. follows directly from Proposition 1 in Imani et al. (2018); involves Lemma 2 and other conditional independence. Details are provided in supplementary materials.
So far, we discussed the policy gradient for a single dimension of the policy parameter , so are all scalars. When we compute policy gradients for the whole in parallel, remain scalars while become vectors of the same size as . This is because our intrinsic interest “function” is a multidimensional random variable, instead of a deterministic scalar function like . We, therefore, generalize the concept of interest.
So far, we also assumed access to the true covariate shift and the true value function . We can plug in their estimation and , yielding the Generalized OffPolicy ActorCritic (GeoffPAC) algorithm. The covariate shift estimation can be learned via the learning rule in (3). The value estimation can be learned by any offpolicy prediction algorithm, e.g., onestep offpolicy TD (Sutton and Barto, 2018), GTD, (Discounted) COPTD or Vtrace (Espeholt et al., 2018). Pseudocode of GeoffPAC is provided in supplementary materials.
We now discuss two potential practical issues with GeoffPAC. First, GOPPG requires . In practice, this means has been executed for a long time and can be satisfied by a warmup before training. Second, GOPPG provides an unbiased sample for a fixed policy . Once is updated, will be invalidated as well as . As their update rule does not have a learning rate, we cannot simply use a larger learning rate for as we would do for . This issue also appeared in Imani et al. (2018). In principle, we could store previous transitions in a replay buffer (Lin, 1992) and replay them for a certain number of steps after is updated. In this way, we can satisfy the requirement and get the uptodate . In practice, we found this unnecessary. When we use a small learning rate for , we assume changes slowly and ignore this invalidation effect.
5 Experimental Results
Our experiments aim to answer the following questions. 1) Can GeoffPAC find the same solution as onpolicy policy gradient algorithms in the twocircle MDP as promised? 2) How does the excursion length influence the solution? 3) Can GeoffPAC scale up to challenging tasks like robot simulation in Mujoco with neural network function approximators? 4) Can the counterfactual objective in GeoffPAC translate into performance improvement over OffPAC and ACE? 5) How does GeoffPAC compare with other downstream extensions of OPPG, e.g., DDPG?
5.1 Twocircle MDP
We implemented a tabular version of ACE and GeoffPAC for the twocircle MDP. The behavior policy was random, and we monitored the probability from A to B under the target policy . In Figure (b)b, we plot
during training. The curves are averaged over 30 runs and the shaded regions indicate standard error. We set
so that both ACE and GeoffPAC are unbiased. For GeoffPAC, was set to . ACE converges to the correct policy that maximizes as expected, while GeoffPAC converges to the policy that maximizes , the policy we want in onpolicy training. Figure (c)c shows how manipulating and can influence the final solution. In this twocircle MDP, has little influence on the final solution, while manipulating can significantly change the final solution.5.2 Robot Simulation
Evaluation: We benchmarked OffPAC, ACE and GeoffPAC on five Mujoco robot simulation tasks from OpenAI gym (Brockman et al., 2016). As all the original tasks are episodic, we adopted similar techniques as White (2017) to compose continuing tasks. We set the discount function to 0.99 for all nontermination transitions and to 0 for all termination transitions. The agent was teleported back to the initial states upon termination. The interest function was always 1. This setting complies with the common training scheme for Mujoco tasks (Lillicrap et al., 2015; Asadi and Williams, 2016). However, we interpret the tasks as continuing tasks. As a consequence, , instead of episodic return, is the proper metric to measure the performance of a policy . The behavior policy is a fixed uniformly random policy, same as Gelada and Bellemare (2019). The data generated by is significantly different from any meaningful policy in those tasks. Thus, this setting exhibits a high degree of offpolicyness. We monitored periodically during training. To evaluate , states were sampled according to , and was approximated via Monte Carlo return. Evaluation based on the commonly used total undiscounted episodic return criterion and more discussion about this criterion is provided in supplementary materials. The curves under the two criterion are almost identical.
Implementation: Although emphatic algorithms have enjoyed great theoretical success (Yu, 2015; Hallak et al., 2016; Sutton et al., 2016; Imani et al., 2018)
, their empirical success is still limited to simple domains (e.g., simple handcrafted Markov chains, cartpole balancing) with linear function approximation. To our best knowledge, this is the first time that emphatic algorithms are evaluated in challenging robot simulation tasks with neural network function approximators. To stabilize training, we adopted the A3C
(Mnih et al., 2016) paradigm with multiple workers and utilized a target network (Mnih et al., 2015)and a replay buffer. All three algorithms share the same architecture and the same parameterization. We first tuned hyperparameters for OffPAC. ACE and GeoffPAC inherited common hyperparameters from OffPAC. For DDPG, we used the same architecture and hyperparameters as
Lillicrap et al. (2015). More details are provided in supplementary materials.Results: We first studied the influence of on ACE and the influence of on GeoffPAC in HalfCheetah. The results are reported in supplementary materials. We found ACE was not sensitive to and set for all experiments. For GeoffPAC, we found produced good empirical results and used this combination for all remaining tasks. All curves are averaged over 10 independent runs and shaded regions indicate standard error. Figure 2 compares GeoffPAC, ACE, and OffPAC. GeoffPAC significantly outperforms ACE and OffPAC in three out of five tasks. The performance on Walker and Reacher is similar. This performance improvement supports our claim that optimizing can better approximate than optimizing . We also report the performance of a random agent for reference. Figure 3 compares GeoffPAC and DDPG. GeoffPAC outperforms DDPG in Hopper and Swimmer. DDPG with a uniformly random policy exhibits high instability in HalfCheetah, Walker, and Hopper. This is expected because DDPG fully ignores the discrepancy between and . As training progresses, this discrepancy gets larger and finally yields a performance drop. This is not a fair comparison in that many design choices for DDPG and GeoffPAC are different (e.g., one worker vs. multiple workers, deterministic vs. stochastic policy, network architectures), and we do not expect GeoffPAC to outperform all applications of OPPG. However, this comparison does suggest GOPPG sheds light on how to improve applications of OPPG.
6 Related Work
There have been many applications of OPPG, e.g., DPG (Silver et al., 2014), DDPG (Lillicrap et al., 2015), ACER (Wang et al., 2016), EPG (Ciosek and Whiteson, 2017), and IMPALA (Espeholt et al., 2018). Particularly, Gu et al. (2017) propose IPG to unify on and offpolicy policy gradients. IPG is a mix of the gradients from the onpolicy objective and the excursion objective. To compute the gradients of the onpolicy objective, IPG does need onpolicy samples. In this paper, the counterfactual objective is a mix of objectives, and we do not need onpolicy samples to compute the policy gradient of the counterfactual objective. Mixing and directly in IPGstyle is a possibility for future work.
There have been other policybased offpolicy algorithms. Maei (2018) provide an unbiased sample for , assuming the value function is linear. Theoretical results are provided without empirical study. Imani et al. (2018) eliminate the linear assumption and provide a thorough empirical study. We therefore conduct our comparison with Imani et al. (2018) instead of Maei (2018). In another line of work, the policy entropy is used for reward shaping. The target policy can then be derived from the value function directly (O’Donoghue et al., 2016; Nachum et al., 2017a; Schulman et al., 2017a). This line of work includes the deep energybased RL (Haarnoja et al., 2017, 2018), where a value function is learned offpolicy and the policy is derived from the value function directly, and path consistency learning (Nachum et al., 2017a, b), where gradients are computed to satisfy certain path consistencies. This line of work is orthogonal to this paper, where we compute the policy gradients of a given objective directly in an offpolicy manner.
Besides the stochastic approximation approaches to learn the covariate shift (Hallak and Mannor, 2017; Gelada and Bellemare, 2019), a closedform solution can be obtained for the case of Reproducing Kernel Hilbert Space for policy evaluation by using a minimax loss (Liu et al., 2018). All three works are valuebased methods. To our best knowledge, we are the first to use the covariate shift for policybased methods and estimate the policy gradient of the covariate shift via emphatic learning.
7 Conclusions
In this paper, we introduced the counterfactual objective unifying the excursion objective and the alternative life objective in the continuing RL setting. We further provided the Generalized OffPolicy Policy Gradient Theorem and corresponding GeoffPAC algorithm. GOPPG is the first example that a nontrivial interest function is used, and GeoffPAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.
There have been numerous applications of OPPG including DDPG, ACER, IPG, EPG and IMPALA. We expect GOPPG to shed light on improving those extensions. Theoretically, a convergent analysis of GeoffPAC involving compatible function assumption (Sutton et al., 2000) or multitimescale stochastic approximation (Borkar, 2009) is also worth further investigation.
Acknowledgments
SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. Special thanks to Richard S. Sutton, who gave this project its initial impetus.
References
 Asadi and Williams (2016) Asadi, K. and Williams, J. D. (2016). Sampleefficient deep reinforcement learning for dialog control. arXiv preprint arXiv:1612.06000.

Bellemare et al. (2013)
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013).
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
.  Borkar (2009) Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint. Springer.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
 Ciosek and Whiteson (2017) Ciosek, K. and Whiteson, S. (2017). Expected policy gradients. arXiv preprint arXiv:1706.05374.
 Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. (2012). Offpolicy actorcritic. arXiv preprint arXiv:1205.4839.
 Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561.
 Gelada and Bellemare (2019) Gelada, C. and Bellemare, M. G. (2019). Offpolicy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.
 Ghiassian et al. (2018) Ghiassian, S., Patterson, A., White, M., Sutton, R. S., and White, A. (2018). Online offpolicy prediction.
 Gu et al. (2017) Gu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schölkopf, B., and Levine, S. (2017). Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems.

Haarnoja et al. (2017)
Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017).
Reinforcement learning with deep energybased policies.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1352–1361. JMLR. org.  Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
 Hallak and Mannor (2017) Hallak, A. and Mannor, S. (2017). Consistent online offpolicy evaluation. In Proceedings of the 34th International Conference on Machine Learning.
 Hallak et al. (2016) Hallak, A., Tamar, A., Munos, R., and Mannor, S. (2016). Generalized emphatic temporal difference learning: Biasvariance analysis. In Proceedins of 30th AAAI Conference on Artificial Intelligence.
 Imani et al. (2018) Imani, E., Graves, E., and White, M. (2018). An offpolicy policy gradient theorem using emphatic weightings. In Advances in Neural Information Processing Systems.
 Levin and Peres (2017) Levin, D. A. and Peres, Y. (2017). Markov chains and mixing times. American Mathematical Soc.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 Lin (1992) Lin, L.J. (1992). Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
 Liu et al. (2018) Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: Infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems.
 Maei (2018) Maei, H. R. (2018). Convergent actorcritic algorithms under offpolicy training and function approximation. arXiv preprint arXiv:1802.07842.
 Marbach and Tsitsiklis (2001) Marbach, P. and Tsitsiklis, J. N. (2001). Simulationbased optimization of markov reward processes. IEEE Transactions on Automatic Control.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Humanlevel control through deep reinforcement learning. Nature.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. (2016). Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems.
 Nachum et al. (2017a) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems.
 Nachum et al. (2017b) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trustpcl: An offpolicy trust region method for continuous control. arXiv preprint arXiv:1707.01891.
 Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.
 O’Donoghue et al. (2016) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and qlearning. arXiv preprint arXiv:1611.01626.
 Osband et al. (2018) Osband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems.
 Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Offpolicy temporaldifference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning.
 Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Schulman et al. (2017a) Schulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning.
 Schulman et al. (2017b) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning.
 Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning.
 Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction (2nd Edition). MIT press.
 Sutton et al. (2009) Sutton, R. S., Maei, H. R., and Szepesvári, C. (2009). A convergent temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in Neural Information Pprocessing Systems.
 Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of offpolicy temporaldifference learning. The Journal of Machine Learning Research.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
 Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporaldiffference learning with function approximation. In Advances in neural information processing systems.
 Wang et al. (2016) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224.
 White (2017) White, M. (2017). Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning.
 White (2018) White, M. (2018). Offpolicy learning. Reinforcement Learning Summer School https://dlrlsummerschool.ca/wpcontent/uploads/2018/09/whiteoffpolicylearningrlss2018.pdf.
 Yu (2015) Yu, H. (2015). On convergence of emphatic temporaldifference learning. In Conference on Learning Theory.
Appendix A Assumptions and Proofs
a.1 Assumptions
a.2 Proof of Lemma 1
Proof. Form Proposition 1.7 in Levin and Peres (2017), there exists an integer such that all the entries of are strictly positive. By definition,
where all the entries of X is nonnegative. With representing elementwise comparison between matrices, we have
Let be the minimum element in , we have and
This leads to . Theorem 4.9 in Levin and Peres (2017) implies that
for all where represents the total variation norm. Neither nor depends on . Uniform convergence follows easily.
a.3 Proof of Lemma 2
Proof.
(Law of total expectation and Markov property)  
(Bayes’ rule and definition of )  
Comments
There are no comments yet.