Generalized Off-Policy Actor-Critic

by   Shangtong Zhang, et al.
University of Oxford

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.


page 1

page 2

page 3

page 4


Characterizing the Gap Between Actor-Critic and Policy Gradient

Actor-critic (AC) methods are ubiquitous in reinforcement learning. Alth...

Multi-Preference Actor Critic

Policy gradient algorithms typically combine discounted future rewards w...

Multi-objective evolution for Generalizable Policy Gradient Algorithms

Performance, generalizability, and stability are three Reinforcement Lea...

Off-Policy Actor-Critic with Emphatic Weightings

A variety of theoretically-sound policy gradient algorithms exist for th...

Meta-Gradient Reinforcement Learning with an Objective Discovered Online

Deep reinforcement learning includes a broad family of algorithms that p...

AlgaeDICE: Policy Gradient from Arbitrary Experience

In many real-world applications of reinforcement learning (RL), interact...

Code Repositories


Reimplementation of

view repo

1 Introduction

Reinforcement learning (RL) algorithms based on the policy gradient theorem (Sutton et al., 2000; Marbach and Tsitsiklis, 2001) have recently enjoyed great success in various domains, e.g., achieving human-level performance on Atari games (Mnih et al., 2016). The original policy gradient theorem is on-policy and used to optimize the on-policy objective. However, in many cases, we would prefer to learn off-policy to improve data efficiency (Lin, 1992) and exploration (Osband et al., 2018). To this end, the Off-Policy Policy Gradient (OPPG) Theorem (Degris et al., 2012; Maei, 2018; Imani et al., 2018) was developed and has been widely used (Silver et al., 2014; Lillicrap et al., 2015; Wang et al., 2016; Gu et al., 2017; Ciosek and Whiteson, 2017; Espeholt et al., 2018).

Ideally, an off-policy algorithm should optimize the off-policy analogue of the on-policy objective. In the continuing RL setting, this analogue would be the performance of the target policy in expectation w.r.t. the stationary distribution of the target policy, which is referred to as the alternative life objective (White, 2018; Ghiassian et al., 2018). This objective corresponds to the performance of the target policy when deployed. However, OPPG optimizes a different objective, the performance of the target policy in expectation w.r.t. the stationary distribution of the behavior policy. This objective is referred to as the excursion objective (White, 2018; Ghiassian et al., 2018), as it corresponds to the excursion setting (Sutton et al., 2016). Unfortunately, the excursion objective can be misleading about the performance of the target policy when deployed, as we illustrate in Section 3.

It is infeasible to optimize the alternative life objective directly in the off-policy continuing setting. Instead, we propose to optimize the counterfactual objective, which approximates the alternative life objective. In the excursion setting, an agent in the stationary distribution of the behavior policy considers a hypothetical excursion that follows the target policy. The return from this hypothetical excursion is an indicator of the performance of the target policy. The excursion objective measures this return w.r.t. the stationary distribution of the behavior policy, using samples generated by executing the behavior policy. By contrast, evaluating the alternative life objective requires samples from the stationary distribution of the target policy, to which the agent does not have access. In the counterfactual objective, we use a new parameter to control how counterfactual the objective is, akin to (Gelada and Bellemare, 2019). With , the counterfactual objective uses the stationary distribution of the behavior policy to measure the performance of the target policy, recovering the excursion objective. With , the counterfactual objective is fully decoupled from the behavior policy and uses the stationary distribution of the target policy to measure the performance of the target policy, recovering the alternative life objective. As in the excursion objective, the excursion is never actually executed and the agent always follows the behavior policy.

Our contributions are threefold. First, we introduce the counterfactual objective. We motivate this objective empirically with an example MDP that highlights the difference between the alternative life objective and the excursion objective. We also motivate it theoretically, by proving that the counterfactual objective can recover both the excursion objective and the alternative life objective smoothly via manipulating . Second, we prove the Generalized Off-Policy Policy Gradient (GOPPG) Theorem, which gives the policy gradient of the counterfactual objective. Third, using an emphatic approach (Sutton et al., 2016)

to compute an unbiased sample for this policy gradient, we develop the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We evaluate Geoff-PAC empirically in challenging robot simulation tasks with neural network function approximators. Geoff-PAC significantly outperforms the actor-critic algorithms proposed by

Degris et al. (2012); Imani et al. (2018), and to our best knowledge, Geoff-PAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

2 Background

We use a time-indexed capital letter (e.g.,

) to denote a random variable. We use a bold capital letter (e.g.,

X) to denote a matrix and a bold lowercase letter (e.g., x

) to denote a column vector. If

is a scalar function defined on a finite set , we use its corresponding bold lowercase letter to denote its vector form, i.e., . We use I

to denote the identity matrix and

1 to denote an all-one column vector.

We consider an infinite horizon MDP (Puterman, 2014) consisting of a finite state space , a finite action space , a bounded reward function and a transition kernel . We consider a transition-based discount function (White, 2017) for unifying continuing tasks and episodic tasks. At time step , an agent at state takes an action according to a policy . The agent then proceeds to a new state according to and gets a reward satisfying . The return of at time step is , where and . We use to denote the value function of , which is defined as . Like White (2017), we assume exists for all . We use to denote the state-action value function of . We use to denote the transition matrix induced by , i.e., . We assume is ergodic and use to denote the corresponding stationary distribution. We define .

In the off-policy setting, an agent aims to learn a target policy but follows a behavior policy . We use the same assumption of coverage as Sutton and Barto (2018), i.e., . We assume is ergodic and use to denote its stationary distribution. Similarly, . We define , and .

Typically, there are two kinds of tasks in RL, prediction and control.

Prediction: In prediction, we are interested in finding the value function of a given policy . Temporal Difference (TD) learning (Sutton, 1988) is perhaps the most popular algorithm for prediction. TD enjoys convergence guarantee in both on- and off-policy tabular settings. TD can also be combined with linear function approximation. The update rule for on-policy linear TD is , where is a step size and is an incremental update. Here we use

to denote an estimation of

parameterized by . Tsitsiklis and Van Roy (1997) prove the convergence of on-policy linear TD. In off-policy linear TD, the update is weighted by . The divergence of off-policy linear TD is well documented (Tsitsiklis and Van Roy, 1997). To approach this issue, Gradient TD (GTD, Sutton et al. 2009) was proposed. Instead of bootstrapping from the prediction of a successor state like TD, GTD computes the gradient of the projected Bellman error directly. GTD is a true stochastic gradient method and enjoys convergence guarantees. However, GTD is a two-time-scale method, involving two sets of parameters and two learning rates, which makes it hard to use in practice. To approach this issue, Emphatic TD (ETD, Sutton et al. 2016) was proposed.

ETD introduces an interest function to specify user preferences for different states. With function approximation, we typically cannot get accurate predictions for all states and must thus trade off between them. States are usually weighted by in the off-policy setting (e.g., GTD) but with the interest function, we can explicitly weight them by in our objective and/or weight the update at time via , where is the emphasis that accumulates previous interests in a certain way. In the simplest form of ETD, we have , where and is a constant. The update is weighted by . In practice, we usually set .

Inspired by ETD, Hallak and Mannor (2017) propose to weight via in the Consistent Off-Policy TD (COP-TD) algorithm, where is the density ratio, which is also known as the covariate shift (Gelada and Bellemare, 2019). To learn via stochastic approximation, Hallak and Mannor (2017) propose the COP operator. However, the COP operator does not have a unique fixed point, and extra normalization and projection is used to ensure convergence (Hallak and Mannor, 2017). To address this limitation, Gelada and Bellemare (2019) further propose the -discounted COP operator.

Gelada and Bellemare (2019) define a new transition matrix where is a constant. Following this matrix, an agent either proceeds to the next state according to w.p.  or gets reset to w.p. . Gelada and Bellemare (2019) prove that is ergodic and


is the stationary distribution of when . However, it is not clear whether holds or not. With , Gelada and Bellemare (2019) prove that


yielding the learning rule


where is an estimate of and is a step size. A semi-gradient is used when is a parameterized function (Gelada and Bellemare, 2019). For small (depending on the difference between and ), Gelada and Bellemare (2019) prove contraction for linear function approximation. For large or nonlinear function approximation, they provide an extra normalization loss for the sake of the constraint . Gelada and Bellemare (2019) use to weight the update in Discounted COP-TD. They demonstrate empirical success in Atari games (Bellemare et al., 2013) with pixel inputs.

Control: In this paper, we focus on policy-based control. In the on-policy continuing setting, we seek to optimize the objective


which is equivalent to optimizing the average reward if both and are constant (White, 2017). We usually set . We assume is parameterized by . In the rest of this paper, all gradients are taken w.r.t.  unless otherwise specified, and we consider the gradient for only one component of for simplicity.

In the off-policy continuing setting, Degris et al. (2012) propose to optimize the excursion objective


instead of the alternative life objective . We can compute the policy gradient as


Degris et al. (2012) prove in the Off-Policy Policy Gradient (OPPG) theorem that we can ignore the term without introducing bias for a tabular policy111See Errata in Degris et al. (2012), also in Imani et al. (2018); Maei (2018). when , yielding a gradient update , where is sampled from and is sampled from . Based on this, Degris et al. (2012) propose the Off-Policy Actor-Critic (Off-PAC) algorithm. For a policy using a general function approximator, Imani et al. (2018) propose a new OPPG theorem. They define


is a constant used to optimize the bias-variance trade-off and

, and prove that is an unbiased sample of when and for a general interest function . Based on this, Imani et al. (2018) propose the Actor-Critic with Emphatic weightings (ACE) algorithm. ACE is an emphatic approach where is the emphasis to reweight the update.

3 The Counterfactual Objective

Figure 1:

(a) The two-circle MDP. Rewards are 0 unless specified on the edge (b) The probability of transitioning to

B from A under target policy during training (c) The influence of and on the final solution found by Geoff-PAC.

We now introduce the counterfactual objective


where is a user-defined interest function. Similarly, we can set to 1 for the continuing setting but we proceed with a general . When , recovers the alternative life objective . When , recovers the excursion objective . To motivate the counterfactual objective , we first present the two-circle MDP (Figure (a)a) to highlight the difference between and .

In the two-circle MDP, an agent only needs to make a decision in state A. The behavior policy proceeds to B or C randomly with equal probability. The discount factor is set to for all transitions. We consider a continuing setting and set . Obviously, and hardly change w.r.t.  due to discounting, and we have and . To maximize , the target policy would prefer transitioning to state C to maximize . However, the policy maximizing (i.e., maximizing the average reward) would prefer transitioning to state B, which is what we usually want in an on-policy setting. Hence, maximizing the excursion objective gives an unexpected solution. This effect can also occur with for larger if the path is longer. With function approximation, the discrepancy can be magnified due to state aliasing, where we may want to make a trade-off between different states according to instead of (White, 2018).

One solution to this problem is to set the interest function in in a clever way. However, it is not clear how to achieve this without domain knowledge. Imani et al. (2018) simply set to 1. Another solution might be to optimize directly in off-policy learning, if one could use importance sampling ratios to fully correct to as Precup et al. (2001) propose for value-based methods in the episodic setting. However, this solution suffers from high variance and is infeasible for the continuing setting (Sutton et al., 2016).

In this paper, we propose to optimize instead. As we prove below, we have , indicating when approaches 1, we can get arbitrarily close to . Furthermore, we show empirically that a small (e.g., 0.6 in the two-circle MDP) is enough to generate a different solution from maximizing .

Lemma 1

Assuming is ergodic, the sequence converges to uniformly when , for , where is a constant and

Proof. The pointwise convergence for each is a standard conclusion for ergodic MDPs (Theorem 4.9 in Levin and Peres 2017). To prove uniform convergence, we need a -independent bound on the distance between and . Details are provided in supplementary materials.

Theorem 1

Assuming is ergodic and , then ,

Proof. First, from (1), . It follows easily that . Second, for each , , together with the uniform convergence from Lemma 1, using the Moore-Osgood Theorem to interchange limits yields

It follows easily that and .

4 Generalized Off-Policy Policy Gradient

In this section, we derive an estimator for and show in Proposition 1 that it is unbiased. Our (standard) assumptions are given in supplementary materials. The OPPG theorem (Imani et al., 2018) leaves us the freedom to choose the interest function in . In this paper, we set , which, to our best knowledge, is the first time that a non-trivial interest is used. Hence, depends on and we cannot invoke OPPG directly as . However, we can still invoke the remaining parts of OPPG:


where . We now compute the gradient .

Theorem 2 (Generalized Off-Policy Policy Gradient Theorem)


Proof. We first use the product rule of calculus and plug in :

follows directly from (8). To show , we take gradients on both sides of (2). We have . Solving this linear system of leads to

With , follows easily.

Now we use an emphatic approach to provide an unbiased sample of . We define

Here functions as an intrinsic interest (in contrast with the user-defined extrinsic interest ) and is a sample for b. accumulates previous interests and translates b into g. is for bias-variance trade-off similar to Sutton et al. (2016); Imani et al. (2018). We now define

and proceed to show that is an unbiased sample of when .

Lemma 2

With , we have for .

Proof. The proof follows similar techniques as Sutton et al. (2016); Hallak and Mannor (2017) and is provided in supplementary materials.

Proposition 1

With a fixed , , we have

Proof. follows directly from Proposition 1 in Imani et al. (2018); involves Lemma 2 and other conditional independence. Details are provided in supplementary materials.

So far, we discussed the policy gradient for a single dimension of the policy parameter , so are all scalars. When we compute policy gradients for the whole in parallel, remain scalars while become vectors of the same size as . This is because our intrinsic interest “function” is a multi-dimensional random variable, instead of a deterministic scalar function like . We, therefore, generalize the concept of interest.

So far, we also assumed access to the true covariate shift and the true value function . We can plug in their estimation and , yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. The covariate shift estimation can be learned via the learning rule in (3). The value estimation can be learned by any off-policy prediction algorithm, e.g., one-step off-policy TD (Sutton and Barto, 2018), GTD, (Discounted) COP-TD or V-trace (Espeholt et al., 2018). Pseudocode of Geoff-PAC is provided in supplementary materials.

We now discuss two potential practical issues with Geoff-PAC. First, GOPPG requires . In practice, this means has been executed for a long time and can be satisfied by a warm-up before training. Second, GOPPG provides an unbiased sample for a fixed policy . Once is updated, will be invalidated as well as . As their update rule does not have a learning rate, we cannot simply use a larger learning rate for as we would do for . This issue also appeared in Imani et al. (2018). In principle, we could store previous transitions in a replay buffer (Lin, 1992) and replay them for a certain number of steps after is updated. In this way, we can satisfy the requirement and get the up-to-date . In practice, we found this unnecessary. When we use a small learning rate for , we assume changes slowly and ignore this invalidation effect.

5 Experimental Results

Our experiments aim to answer the following questions. 1) Can Geoff-PAC find the same solution as on-policy policy gradient algorithms in the two-circle MDP as promised? 2) How does the excursion length influence the solution? 3) Can Geoff-PAC scale up to challenging tasks like robot simulation in Mujoco with neural network function approximators? 4) Can the counterfactual objective in Geoff-PAC translate into performance improvement over Off-PAC and ACE? 5) How does Geoff-PAC compare with other downstream extensions of OPPG, e.g., DDPG?

5.1 Two-circle MDP

We implemented a tabular version of ACE and Geoff-PAC for the two-circle MDP. The behavior policy was random, and we monitored the probability from A to B under the target policy . In Figure (b)b, we plot

during training. The curves are averaged over 30 runs and the shaded regions indicate standard error. We set

so that both ACE and Geoff-PAC are unbiased. For Geoff-PAC, was set to . ACE converges to the correct policy that maximizes as expected, while Geoff-PAC converges to the policy that maximizes , the policy we want in on-policy training. Figure (c)c shows how manipulating and can influence the final solution. In this two-circle MDP, has little influence on the final solution, while manipulating can significantly change the final solution.

5.2 Robot Simulation

Figure 2: Comparison among Off-PAC, ACE and Geoff-PAC. Black dash lines are random agents.
Figure 3: Comparison between DDPG and Geoff-PAC

Evaluation: We benchmarked Off-PAC, ACE and Geoff-PAC on five Mujoco robot simulation tasks from OpenAI gym (Brockman et al., 2016). As all the original tasks are episodic, we adopted similar techniques as White (2017) to compose continuing tasks. We set the discount function to 0.99 for all non-termination transitions and to 0 for all termination transitions. The agent was teleported back to the initial states upon termination. The interest function was always 1. This setting complies with the common training scheme for Mujoco tasks (Lillicrap et al., 2015; Asadi and Williams, 2016). However, we interpret the tasks as continuing tasks. As a consequence, , instead of episodic return, is the proper metric to measure the performance of a policy . The behavior policy is a fixed uniformly random policy, same as Gelada and Bellemare (2019). The data generated by is significantly different from any meaningful policy in those tasks. Thus, this setting exhibits a high degree of off-policyness. We monitored periodically during training. To evaluate , states were sampled according to , and was approximated via Monte Carlo return. Evaluation based on the commonly used total undiscounted episodic return criterion and more discussion about this criterion is provided in supplementary materials. The curves under the two criterion are almost identical.

Implementation: Although emphatic algorithms have enjoyed great theoretical success (Yu, 2015; Hallak et al., 2016; Sutton et al., 2016; Imani et al., 2018)

, their empirical success is still limited to simple domains (e.g., simple hand-crafted Markov chains, cart-pole balancing) with linear function approximation. To our best knowledge, this is the first time that emphatic algorithms are evaluated in challenging robot simulation tasks with neural network function approximators. To stabilize training, we adopted the A3C

(Mnih et al., 2016) paradigm with multiple workers and utilized a target network (Mnih et al., 2015)

and a replay buffer. All three algorithms share the same architecture and the same parameterization. We first tuned hyperparameters for Off-PAC. ACE and Geoff-PAC inherited common hyperparameters from Off-PAC. For DDPG, we used the same architecture and hyperparameters as

Lillicrap et al. (2015). More details are provided in supplementary materials.

Results: We first studied the influence of on ACE and the influence of on Geoff-PAC in HalfCheetah. The results are reported in supplementary materials. We found ACE was not sensitive to and set for all experiments. For Geoff-PAC, we found produced good empirical results and used this combination for all remaining tasks. All curves are averaged over 10 independent runs and shaded regions indicate standard error. Figure 2 compares Geoff-PAC, ACE, and Off-PAC. Geoff-PAC significantly outperforms ACE and Off-PAC in three out of five tasks. The performance on Walker and Reacher is similar. This performance improvement supports our claim that optimizing can better approximate than optimizing . We also report the performance of a random agent for reference. Figure 3 compares Geoff-PAC and DDPG. Geoff-PAC outperforms DDPG in Hopper and Swimmer. DDPG with a uniformly random policy exhibits high instability in HalfCheetah, Walker, and Hopper. This is expected because DDPG fully ignores the discrepancy between and . As training progresses, this discrepancy gets larger and finally yields a performance drop. This is not a fair comparison in that many design choices for DDPG and Geoff-PAC are different (e.g., one worker vs. multiple workers, deterministic vs. stochastic policy, network architectures), and we do not expect Geoff-PAC to outperform all applications of OPPG. However, this comparison does suggest GOPPG sheds light on how to improve applications of OPPG.

6 Related Work

There have been many applications of OPPG, e.g., DPG (Silver et al., 2014), DDPG (Lillicrap et al., 2015), ACER (Wang et al., 2016), EPG (Ciosek and Whiteson, 2017), and IMPALA (Espeholt et al., 2018). Particularly, Gu et al. (2017) propose IPG to unify on- and off-policy policy gradients. IPG is a mix of the gradients from the on-policy objective and the excursion objective. To compute the gradients of the on-policy objective, IPG does need on-policy samples. In this paper, the counterfactual objective is a mix of objectives, and we do not need on-policy samples to compute the policy gradient of the counterfactual objective. Mixing and directly in IPG-style is a possibility for future work.

There have been other policy-based off-policy algorithms. Maei (2018) provide an unbiased sample for , assuming the value function is linear. Theoretical results are provided without empirical study. Imani et al. (2018) eliminate the linear assumption and provide a thorough empirical study. We therefore conduct our comparison with Imani et al. (2018) instead of Maei (2018). In another line of work, the policy entropy is used for reward shaping. The target policy can then be derived from the value function directly (O’Donoghue et al., 2016; Nachum et al., 2017a; Schulman et al., 2017a). This line of work includes the deep energy-based RL (Haarnoja et al., 2017, 2018), where a value function is learned off-policy and the policy is derived from the value function directly, and path consistency learning (Nachum et al., 2017a, b), where gradients are computed to satisfy certain path consistencies. This line of work is orthogonal to this paper, where we compute the policy gradients of a given objective directly in an off-policy manner.

Besides the stochastic approximation approaches to learn the covariate shift (Hallak and Mannor, 2017; Gelada and Bellemare, 2019), a closed-form solution can be obtained for the case of Reproducing Kernel Hilbert Space for policy evaluation by using a minimax loss (Liu et al., 2018). All three works are value-based methods. To our best knowledge, we are the first to use the covariate shift for policy-based methods and estimate the policy gradient of the covariate shift via emphatic learning.

7 Conclusions

In this paper, we introduced the counterfactual objective unifying the excursion objective and the alternative life objective in the continuing RL setting. We further provided the Generalized Off-Policy Policy Gradient Theorem and corresponding Geoff-PAC algorithm. GOPPG is the first example that a non-trivial interest function is used, and Geoff-PAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

There have been numerous applications of OPPG including DDPG, ACER, IPG, EPG and IMPALA. We expect GOPPG to shed light on improving those extensions. Theoretically, a convergent analysis of Geoff-PAC involving compatible function assumption (Sutton et al., 2000) or multi-timescale stochastic approximation (Borkar, 2009) is also worth further investigation.


SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. Special thanks to Richard S. Sutton, who gave this project its initial impetus.


Appendix A Assumptions and Proofs

a.1 Assumptions

We use the same standard assumptions as Yu (2015) and Imani et al. (2018). We also assume exists for all .

a.2 Proof of Lemma 1

Proof. Form Proposition 1.7 in Levin and Peres (2017), there exists an integer such that all the entries of are strictly positive. By definition,

where all the entries of X is non-negative. With representing elementwise comparison between matrices, we have

Let be the minimum element in , we have and

This leads to . Theorem 4.9 in Levin and Peres (2017) implies that

for all where represents the total variation norm. Neither nor depends on . Uniform convergence follows easily.

a.3 Proof of Lemma 2


(Law of total expectation and Markov property)
(Bayes’ rule and definition of )