Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

01/27/2019 ∙ by Carles Gelada, et al. ∙ Google 14

In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.'s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.



There are no comments yet.


page 8

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Central to reinforcement learning is the idea that an agent should learn from experience. While many algorithms learn in a purely online fashion, sample-efficient methods typically make use of past data, viewed either as a fixed dataset, or stored in a replay memory (lin93scaling lin93scaling, mnih15human mnih15human). Because this past data may not be generated according to the policy currently under evaluation, the agent is said to be learning off-policy (sutton18reinforcement sutton18reinforcement).

By now it is well-documented that off-policy learning may carry a significant cost when combined to function approximation. Early results have shown that estimating the value function off-policy, using Bellman updates, may diverge (baird95residual baird95residual), (tsitsiklis97analysis tsitsiklis97analysis). More recently, value divergence was perhaps the most significant issue dealt with in the design of the DQN agent (mnih15human mnih15human), and remains a source of concern in deep reinforcement learning (vanhasselt16deep vanhasselt16deep).

Further, under off-policy learning, the quality of the Bellman fixed point suffers as studied by Kolter2011TheFP Kolter2011TheFP and munos03error munos03error. The value function error can be unboundedly large even if the value function can be perfectly approximated. Hence, even in the case where convergence to the fixed point with off-policy data occurs, solutions can be of poor quality. Thus, the existing TD learning algorithms with convergence guarantees under off-policy data (Maei2009ConvergentTL Maei2009ConvergentTL), (Sutton2009FastGM Sutton2009FastGM) can still suffer from off-policy issues.

This paper studies the covariate shift method for dealing with the off-policy problem. The covariate shift method, studied by hallak17consistent hallak17consistent and sutton16emphatic sutton16emphatic, reweights online updates according to the ratio of the target and behavior stationary distributions. Under optimal conditions, the covariate shift method recovers convergent behavior with linear approximation, breaking what sutton18reinforcement sutton18reinforcement call the “deadly triad” of reinforcement learning. We argue the method is particularly appealing in the context of replay memories, where the reweighting can be replaced by a reprioritization scheme similar to that of schaul16prioritized schaul16prioritized.

We improve on hallak17consistent’s COP-TD algorithm, which has provable guarantees but is difficult to implement in a deep reinforcement learning setting. First, we introduce a discount factor into their update rule to obtain a more stable algorithm. Second, we develop an alternative normalization scheme that can be combined with deep networks, avoiding the projection step necessary in the original algorithm. We perform an empirical study of the two methods and their variants on the game Pong from the Arcade Learning Environment and find that our improvements give rise to significant benefits for off-policy learning.


We study the standard RL setting in which an agent interacts with an environment by observing a state, selecting an action, receiving a reward, and observing the next state. We model this process with a Markov Decision Process

. Here, , denote the state and action spaces and is the transition function. Throughout we will assume that and are finite and write . A policy maps a state to a distribution over actions, is the reward function, and is the discount factor.

We are interested in the policy evaluation problem, where we seek to learn the value function of a policy from samples. The value function is the expected sum of discounted rewards from a state when following policy :

where each action is drawn from the policy , i.e. , and states are drawn from the transition function: . We combine the policy and transition function into a state-to-state transition function , whose entries are

Let be the expected reward under . One of the key properties of the value function is that it satisfies the Bellman equation:

In vector notation (puterman94markov puterman94markov), this becomes

where , , and . The value function is in fact the fixed point of the Bellman operator , defined as

The Bellman operator describes a single step of dynamic programming (bellman57dynamic bellman57dynamic) or bootstrap; the process converges to . More interestingly for us, the operator also describes the expected behavior of learning rules such as temporal-difference learning (sutton88learning sutton88learning) and consequently their learning dynamics (tsitsiklis97analysis tsitsiklis97analysis). In the sequel, whenever we analyze the behavior of operators it is with this relationship in mind.

In this paper we consider the process of learning from samples drawn from and a behavior policy . Following standard usage, we call this process off-policy learning (sutton18reinforcement sutton18reinforcement).

Let . We write for the corresponding diagonal matrix. For a matrix , the -weighted squared seminorm of a vector is . We specialize this notation to vectors in as . We write for the vector of all ones and for the simplex over states (. Finally, recall that is the stationary distribution of a transition function if and only if

This distribution is unique when

defines a Markov chain with a single recurrent class (called a unichain, meyn12markov meyn12markov). Throughout we will write

and for the stationary distributions of and , respectively.

Off-Policy Learning with Linear Approximation

In most practical applications the size of the state space precludes so-called tabular representations, which learn the value of each state separately. Instead, one must approximate the value function. One common scheme is linear function approximation, which uses a mapping from states to features . The approximate value function at is then the inner product of a feature vector with a vector of weights :


If denotes the matrix of row-feature vectors, (1) becomes, in vector notation:

The semi-gradient update rule for TD learning (sutton18reinforcement sutton18reinforcement) learns an approximation of from sample transitions. Given a starting state , a successor state , and a step-size parameter , this update is


While (2) does not correspond to a proper gradient descent procedure (see e.g. barnard93temporaldifference barnard93temporaldifference), it can be shown to converge, as we shall now see.

The expected behavior of the semi-gradient update rule is described by the projected Bellman operator, denoted for some distribution (tsitsiklis97analysis tsitsiklis97analysis). The projected Bellman operator is the combination of the usual Bellman operator with a projection in norm onto the span of . Typically, the learning rule (2) is studied in an online setting, where samples correspond to an agent sequentially experiencing the environment. In the simplest case where an agent follows a single behavior policy , this corresponds to .

The stationary point of (2), if it exists, is the solution of the projected Bellman equation

When the projection is performed under the stationary distribution , (2) converges to this fixed point provided is taken to satisfy the Robbins-Monro conditions and other mild assumptions(see tsitsiklis97analysis tsitsiklis97analysis). Taking , however, may lead to divergence of the weight vector in (2). A sign of the importance of this issue can be seen in sutton18reinforcement’s choice to dub “deadly triad” the combination of off-policy learning, function approximation, and bootstrapping.

A prerequisite to guarantee the convergence of (2) to is that


should also converge for any initial condition . tsitsiklis97analysis proved convergence when by showing that the projected Bellman operator is a contraction in -weighted norm. That is, for any ,

from which an application of Banach’s fixed point theorem allows us to conclude that . More formally, the result follows from noting that the induced operator norm of . The lack of a similar result when explains the divergence, and is by now well-documented in the literature (baird95residual baird95residual).

Independent of the convergence issues raised by off-policy learning, the fixed point of the Bellman equation with linear function approximation is also affected. A metric for the quality of a value function is , the expected value prediction error sampling states from the stationary distribution of the policy evaluated. Under some conditions, we can bound the quality of the fixed point under off-policy data as a constant factor times the optimal prediction error .

Theorem 1.

[Based on munos03error munos03error] Let be some arbitrary distribution. Suppose that and there is a fixed point to the projected Bellman equation . Then its approximation error in -weighted norm is at most

Furthermore, this error is minimized when .

Theorem 1 is interesting because it suggests that is also the optimal in the sense that it yields the smallest approximation bound. Kolter2011TheFP Kolter2011TheFP showed that when Theorem 1 does not apply (because ) it is possible to construct examples where the fixed point error is unbounded, even if (i.e. cases where a perfect solution exists). Thus, no general bound on the quality of the off-policy fixed point exists.

Not only do we expect that improvements in off-policy learning should lead to more stable learning behavior, but also to the improved quality of the value functions which, in the control setting, should translate to increases in performance. In this paper we will study the covariant shift approach to off-policy learning, where updates in (2) are reweighting so as to induce a projection under .

The Covariate Shift Approach

Suppose that the stationary distributions and are known, and that states are updated according to a distribution . We use importance sampling (e.g. precup00eligibility precup00eligibility) to define the update rule

where as before , is equivalent to applying the semi-gradient update rule (2) under the sampling distribution . Further multiplying the update term by , we recover (in expectation) the semi-gradient update rule for learning , under the sampling distribution (hallak17consistent hallak17consistent). Thus, provided we reweighted updates correctly, we obtain a provably convergent off-policy algorithm.

The COP-TD learning rule proposed by hallak17consistent learns the ratio from samples. Although much of the original work is concerned with the combined learning dynamics of the value and the ratio, we will focus on the process by which this ratio is learned.

Similar to temporal difference learning, COP-TD estimates by bootstrapping from a previous prediction. Given a step-size , a ratio vector and a sample transition where , , and , COP-TD performs the following update:


Note that this update rule learns “in reverse” compared to TD learning. The expected behavior of the update rule is captured by the COP operator :

In vector notation, this operator is:


Any multiple of is a fixed point of : , for . hallak17consistent, under the assumption that the transition matrix

has a full set of real eigenvectors, give a partial proof that the iterates

converge to such a fixed point. Our first result is to provide an alternative proof of convergence that does not require this assumption.

Theorem 2.

Suppose that defines an ergodic Markov chain on the state space , and let . Then the process converges to , where is a positive scalar.

Corollary 1.

Suppose that the conditions of Theorem 2 are met. Define the normalized COP operator

Then the unique fixed point of the operator is the ratio , to which the process converges.

COP-TD with Linear Function Approximation

The covariate shift method is called-for when the value function is approximated. Under these circumstances, one might expect that we also need to learn an approximate ratio . hallak17consistent consider the linear approximation

where .111In practice, we may avoid negative ’s by clipping them at 0. This gives rise to a semi-gradient update rule similar to (2) but implementing (4):

and also followed by a projection step on the -weighted simplex defined by the set :

The projection step ensures that the approximate ratio corresponds to some distribution ration for . The combined process is summarized by the normalized COP operator: , whose repeated application converges to some approximate ratio.

One interesting fact is that the semi-gradient update rule, which corresponds to a -weighted projection, is by itself insufficient to guarantee the good behavior of the algorithm.

Lemma 1.

Let be a symmetric COP-TD operator and be the projection onto in norm. If is not in the span of , then is the only solution to

Lemma 1 argues that the normalization step is not only a convenience but is in fact necessary for the process to converge to anything meaningful. This is further validated by numerical experiments with general , and where we obseve that the repeated application of operator either converges to 0 or diverges.

A Practical COP-TD

In this paper we are concerned with the application of COP-TD to practical scenarios, where approximating is a must. As the following observations suggest, however, there are a number of limitations to COP-TD.

Lack of contraction factor. The operator is not in general a contraction mapping. Hence, while the process converges, it may do so at a slow rate, with greater variations in the sample-based case, and more importantly may be unstable when combined with function approximation.

Hard-to-satisfy projection step. In the approximate case, we saw that it is necessary to combine the COP operator to a projection onto the

-weighted simplex. Although it is possible to approximate this projection step in an online, sample-based manner for linear function approximation (hallak17consistent recommend constraining the weights to the simplex generated by a sufficiently large enough sample), no counterpart exists for more general classes of function approximations, making COP-TD hard to combine with neural networks.

In what follows we address these two issues in turn.

The Discounted COP Learning Rule

While repeated applications of the operator converge to , the operator is not in general a contraction mapping, and its convergence profile is tied to the (usually unknown) mixing time of the Markov chain described by . Our main contribution is the -discounted COP-TD learning rule, which recovers COP-TD for .

Definition 1.

Let . For a step-size , discount factor , and sample drawn respectively from , and , the -discounted COP-TD learning rule is


The corresponding operator is

By inspection, it is clear that . However, as we will see, the discounted COP-TD learning rule has several desirable properties compared to its undiscounted counterpart. We begin by characterizing the discounted COP operator.

Definition 2.

For a given , we define the discounted reset transition function as:

where is the matrix whose columns are all .

The discounted reset transition function can be understood as a process which either transitions as usual with probability , or resets to the stationary distribution with the remainder probability. This is analogous to the perspective of the discount factor as a probability of terminating [White2017], and is related to the constraint that arises in the dual formulation of the value function [Wang et al.2008].

We denote by the stationary distribution satisfying . As an aside, the inclusion of the reset guarantees the ergodicity of the Markov chain defined by .

Proposition 1.

The stationary distribution is given by

where the sum

is convergent for .

Put another way, describes an exponentially weighted sum of -step deviations from the behavior policy’s stationary distribution , where the deviation corresponds to applying transition times.

Lemma 2.

For , the ratio is the unique fixed point of the operator .

Theorem 3.

Let . For the process converges to , where is the stationary distribution of the transition function corresponding to the given .

One of the most appealing properties of the discounted operator (for ) is that it neither requires normalization, or even positive initial values to guarantee convergence. As we shall see, this greatly simplifies the learning process.

Discounted COP with Linear Function Approximation

The appeal of the COP-TD learning rule is that it can be applied online. The same remains true for our discounted COP learning rule (6). Naturally, when combined with function approximation the same issue of norm arises: can our learning process itself be guaranteed to converge? The answer is yes, provided the discount factor is taken to be small enough.

To begin, let us assume sample transitions are drawn as , as before. Because is the stationary distribution, also. The process we study is therefore described by the projected discounted COP operator .

Lemma 3.

The induced operator norm of the COP operator is upper bounded by a constant , in the sense that

Further, the series can be bounded by a constant,

The term is a concentration coefficient similar to those studied by munos03error munos03error. Intuitively, it measures the discrepancy in stationary distributions between two states that are “close” according to , in the sense that is reachable from in steps. When , the sum simplifies to and this term is 1.

We can make use of the concentration coefficient to provide a safe value of below which the discounted COP learning rule is convergent. Although most of our work concerns 1-step updates, we provide a slightly more general result on -step methods here, based on known contraction results (sutton18reinforcement sutton18reinforcement) and the existing multi-step extension of COP-TD (hallak17consistent hallak17consistent).

Theorem 4.

Consider the -step discounted COP operator . Then for any ,

and in particular for , is a contraction mapping. Since is a bounded series, the exponential factor is guaranteed to dominate. As a result, there exists a value of for which the projected -step discounted COP operator is a contraction mapping.

Theorem 4 shows that we can avoid the usual divergence issues with the learning rule (6) by taking a sufficiently small . While these results are not altogether surprising (they mirror the case of value function approximation), we emphasize that there is no equivalent guarantee in the undiscounted case.

More generally, we are unlikely to be in the worst-case scenario achieving the concentration coefficient and, as our empirical evaluation will show, divergence does not seem to be a problem even with large . Yet, one may wonder whether it is relevant at all to learn an approximation to . Using Theorem 1 we argue that since the bound is continuous in the learning distribution we can expect improved performance even when the covariate shift is approximated for or where a prediction error due to function approximation occurs.

Taken as a whole, our results suggest that incorporating the discount factor should improve the behavior of the COP-TD algorithm in practice.

Soft Ratio Normalization

Suppose we are given a function differentiable w.r.t. its parameters for which we would like that

A common approach in deep reinforcement learning settings is to treat this as an additional loss to be minimized. In this section we also follow this approach, and consider minimizing the normalization loss


The gradient of this loss is


We seek an unbiased estimate of this gradient. However, we cannot recover such an estimate with a single sample

, in a classic case of the double-sampling problem (baird95residual baird95residual). In particular, it is not hard to see that

However, we can obtain such an estimate by considering samples drawn from . The quantity

is an unbiased estimate of the loss gradient (8). In fact, as the following theorem states, we can do better by allowing each sample to play both roles in the estimate, and averaging the results.

Theorem 5.

Consider a differentiable function

and the loss function (

7). Given independent samples drawn from ,

is an unbiased estimate of .

In our experimental section we will see that the normalization loss plays an important role in making COP-TD practical.

Figure 1: with 5 seeds per run for 150 iterations. Left. Comparing discount factors in Pong. Using a discount factor gives a significant performance improvement. Right. Comparing normalization weights in Pong. Using normalization helps learning, but a large normalization weight causes divergence in the values.
Figure 2: with 3 seeds for 150 iterations. Performance of discounted COP-TD with a small target update period of 1000 and on 5 Atari 2600 games.

Experimental Results

In this section we provide empirical evidence demonstrating that our method yields useful benefits in an off-policy, deep reinforcement learning setting. In our experiments we use the Arcade Learning Environment (ALE) (bellemare13arcade bellemare13arcade), an RL interface to Atari 2600 games. We consider the single-GPU agent setup pioneered by mnih15human mnih15human. In this setup, the agent uses a replay memory (implemented as a windowed buffer) to store past experience, which it trains on continuously. As a result, much of the agent’s learning carries an off-policy flavor.

We focus on a fixed behavior policy, specifically the uniformly random policy. We are interested in learning as good of a control policy as we can. That is, at each step the target policy is the greedy policy with respect to the predicted -values. While the theory we developed here applies to the policy evaluation case, we believe this setup to be a more practical and more stringent test of the idea. We emphasize that on the ALE, the uniformly random policy generates data that is significantly different from any learned policy; as a result, our experiments exhibit a high degree of off-policyness. To the best of our knowledge, we are the first to consider such a drastic setting.

Figure 3: with 3 seeds for 50 iterations. 4-way performance comparison using the discounted COP-TD loss as an auxiliary task and TD error prioritization as in (schaul16prioritized schaul16prioritized), blue line corresponds to the corrected agent with at iteration 50.
Figure 4: From same runs shown in Figure 1, left. Average predicted ratio in evaluation episode for a set of in the game of Pong.


Our baseline is the C51 distributional reinforcement learning agent [Bellemare, Dabney, and Munos2017]

, and we use published hyperparameters unless otherwise noted. We augment the C51 network by adding an extra head, the ratio model

, to the final convolutional layer, whose role is to predict the ratio

. The ratio model consists of a two-layer fully-connected network with a ReLU hidden layer of size 512. Whenever a correction term is used as a sampling priority or to compute a bootstrapping target, we clip negative outputs to 0. In what follows

denotes the parameters of the target network, which includes the ratio model.

In initial experiments we found that multiplicatively reweighting the loss function using covariate shifts hurt performance, likely due to larger gradient variance. Instead, to reweight sample transitions we use a prioritized replay memory (schaul16prioritized schaul16prioritized) where priorities correspond to the approximate ratios of our model, which in expectation recovers the reweighting. These adjusted sampling priorities result in large portions of the dataset being mostly ignored (i.e. those unlikely under policy

); hence, the effective size of the data set is reduced and we risk overfitting. In our experiment we mitigated this effect by taking a larger replay memory size (10 million frames) than usual.

Identical to C51, the target policy is the -greedy policy with respect to the expected value of the distribution output of the target network. We set . The ratio model is trained by adding the squared loss


to the usual distributional loss of the agent, where is a hyperparameter trading off the two losses. In experiments where we also normalize the ratio model, a third loss (with corresponding weight hyperparameter) is also added. Preliminary experiments showed that learning the ratio with prioritized sampling led to stability issues, hence we train the ratio model by sampling transitions uniformly from the replay memory. Each training step samples two independent transition batches, prioritized and uniform for the value function and covariate shift respectively.

Since the training is done ”backwards in time”, no valid transition exists that would update the correction of an initial state . This is similar to how there is no valid transition that updates the value of the terminal state in an episodic MDP. However, the distribution of any initial state is policy-independent, and so its ratio is . As a result, we modify the loss (9) for initial states by replacing the bootstrap target with . A more detailed analysis of our method in the episodic case is provided in the supplementary material.

Figure 5: Sample states (frames) encountered under the random policy, predicted either as relatively less likely under (low ) or relatively more likely under (high ). The experiment clipped the corrections at 0.0025 which was later found to be unnecessary.

Discounting and Normalization

We first study the effect of using the discounted COP update rule and/or normalization in the context of the game of Pong. In Pong, the random agent achieves an average score close to -21, the minimum (bellemare13arcade bellemare13arcade). Figure 1, left, compares the learning curves of various values of the discount factor , the agent with no corrections and the random baseline using a ratio loss weight . For discount factors not too large (all except ) better performance compared to the uncorrected baseline is achieved. Using normalization instead (Figure 1, right) also improves performance. However, for high values of the normalization weight, we observed unstable and sometimes divergent behavior. The runs which can be seen to stop mid plot had diverged in their outputs (Figure 6, appendix). We speculate that the reason for the divergence is higher variance in the loss function, and that a smaller step size might reduce such instabilities. Since using a discount factor proved more stable, has a slight performance advantage and is better understood theoretically than normalization, we will center the rest of the empirical evaluation around it.

In Figure 2 we report results for five Atari games chosen on the basis that a random agent playing these games explores the state space at least sufficiently to provide useful data for off-policy learning. We run C51 with discounted corrections with and . We observe performance improvements in Seaquest, Breakout and Pong, no noticeable difference in Asterix and a small loss in Space Invaders.

Auxiliary tasks and Prioritization

One might wonder if the performance benefits observed are really due to sampling from a more on-policy distribution. Auxiliary tasks in the form of extra prediction losses have successfully been used to learn better representations and aid the learning of the value function [Jaderberg et al.2016], [Aytar et al.2018]. To validate that the gains originate from correcting the off-policy data distribution as opposed to better representations, we show a modification of the previous experiment where the covariate shift was learned but not used. We also compare how our proposed prioritization scheme compares to the one originally proposed by (schaul16prioritized schaul16prioritized) who used a function of the TD error to set the priorities. A four-way comparison of auxiliary tasks and TD error prioritization is shown in Figure 3 where . We note that neither using the covariate shift prediction as an auxiliary task nor using TD error based prioritized sampling seemed to make any difference in all the games except SpaceInvaders. Interestingly, the covariate shift auxiliary task in SpaceInvaders helped when uniform sampling was used but hurt under prioritization.

The Effect of the Discount Factor

To better understand the effect of in the learned ratios Figure 4 shows the average predicted ratio over evaluation episodes (where the -greedy policy is used instead of the uniform random policy) in the game of Pong for the same set of runs shown in Figure 1, left. Too large a discount () causes a decrease in performance, (see Figure 1, left) and one might expect that divergence of the ratios would be the cause. Surprisingly, we observe that show no signs of divergence. We emphasize is that the average episode ratio decays monotonically with , hinting that there is a tendency for the ratios to collapse to 0 which overcomes any potential divergence issues.

Qualitative Evaluating the Learned Ratios

As an additional experiment, we qualitatively assessed the ratio learned by our deep network. We generated 100,000 sample states by executing the random behavior policy on Pong. From these, we selected the top and bottom 50 states according to the ratio ( value) predicted by an agent trained under the regime of the previous section for 50 million frames. Recall that that means the network believes the state is more likely under than , while when the converse is true.

Figure 5 shows the outcome of this experiment for the top 3 states in terms of -value, and 3 low- states; additional frames are provided in the supplemental. While our results remain qualitative, we see a clear trend in the selected images. States that are assigned low correspond to those in which the opponent is about to score a point (2nd and 3rd images). The network also assigns a low to a state in which the opponent has scored a high number of points (18 out of a possible total of 21) compared to the agent’s (0 out of 21). This is indeed an unlikely state under : if the trained agent ties the computer opponent, on average, then we expect its score to roughly match that of the opponent.

By contrast, states that are likely under are those for in which the agent successfully returns the ball. These are naturally unlikely situations under , which plays randomly, but likely under the more successful policy , which has learned to avoid the negative reward associated with failing to return the ball.

From this qualitative evidence we conclude that our model learns to clearly distinguish likely and unlikely sample transitions. We believe these results are particularly significant given the relative scarcity of off-policy methods of this kind in deep reinforcement learning.


In this paper we revisited hallak17consistent’s COP-TD algorithm and extended it to be applicable to the deep reinforcement learning setting. While these results on the Atari 2600 suite of games remain preliminary, they demonstrate the practicality of learning the covariate shift in complex settings. We believe our results further open the door to increased sample efficiency in deep reinforcement learning.

We emphasize that the instabilities observed when learning the covariate shift under prioritized sampling point to the importance of the data distribution used to learn the ratios. Which distribution is optimal will be the focus of future work. The covariate shift method is a “backward” off-policy method, in the sense that it corrects a mismatch between distributions based on past transitions. It would be interesting to combine our method to “forward” off-policy methods such as Retrace (munos16safe munos16safe), which have also yielded good results on the Atari 2600 (gruslys18reactor gruslys18reactor). Then, it would be interesting to understand whether overfitting does occur due to a smaller effective replay size, and how this can be addressed. Finally, an exciting avenue would be extending the method to the more general case where multiple policies have generated off-policy data, which would allow COP-TD to be applied in the standard control setting.


We would like to thank Dale Schuurmans, James Martens, Ivo Danihelka, Danilo J. Rezende for insightful discussion. We also thank Jacob Buckman, Saurabh Kumar, Robert Dadashi and Nicolas Le Roux for reviewing and improving the draft.


  • [Aytar et al.2018] Aytar, Y.; Pfaff, T.; Budden, D.; Paine, T. L.; Wang, Z.; and de Freitas, N. 2018. Playing hard exploration games by watching Youtube. CoRR abs/1805.11592.
  • [Baird1995] Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In

    Proceedings of the twelfth international conference on machine learning (ICML 1995)

    , 30–37.
  • [Barnard1993] Barnard, E. 1993.

    Temporal-difference methods and Markov models.

    IEEE Transactions on Systems, Man, and Cybernetics.
  • [Bellemare et al.2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

  • [Bellemare, Dabney, and Munos2017] Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning.
  • [Bellman1957] Bellman, R. E. 1957. Dynamic programming. Princeton, NJ: Princeton University Press.
  • [Gruslys et al.2018] Gruslys, A.; Dabney, W.; Azar, M. G.; Piot, B.; Bellemare, M.; and Munos, R. 2018. The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In International Conference on Learning Representations.
  • [Hallak and Mannor2017] Hallak, A., and Mannor, S. 2017. Consistent on-line off-policy evaluation.
  • [Jaderberg et al.2016] Jaderberg, M.; Mnih, V.; Czarnecki, W.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Reinforcement learning with unsupervised auxiliary tasks. CoRR abs/1611.05397.
  • [Kolter2011] Kolter, J. Z. 2011. The fixed points of off-policy TD. In NIPS.
  • [Lin1993] Lin, L. 1993. Scaling up reinforcement learning for robot control. In Machine Learning: Proceedings of the Tenth International Conference, 182–189.
  • [Maei et al.2009] Maei, H. R.; Szepesvári, C.; Bhatnagar, S.; Precup, D.; Silver, D.; and Sutton, R. S. 2009. Convergent temporal-difference learning with arbitrary smooth function approximation. In NIPS.
  • [Meyn and Tweedie2012] Meyn, S. P., and Tweedie, R. L. 2012. Markov chains and stochastic stability.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
  • [Munos et al.2016] Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. G. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems.
  • [Munos2003] Munos, R. 2003. Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Learning.
  • [Precup, Sutton, and Singh2000] Precup, D.; Sutton, R. S.; and Singh, S. P. 2000. Eligibility traces for off-policy policy evaluation. In Proceedings of the International Conference on Machine Learning.
  • [Puterman1994] Puterman, M. L. 1994. Markov Decision Processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc.
  • [Schaul et al.2016] Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2016. Prioritized experience replay. In International Conference on Learning Representations.
  • [Sutton and Barto2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT Press, 2nd edition.
  • [Sutton et al.2009] Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesvári, C.; and Wiewiora, E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In ICML.
  • [Sutton, Mahmood, and White2016] Sutton, R. S.; Mahmood, A. R.; and White, M. 2016. An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research.
  • [Sutton1988] Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3(1):9–44.
  • [Tsitsiklis and Van Roy1997] Tsitsiklis, J. N., and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5):674–690.
  • [van Hasselt, Guez, and Silver2016] van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • [Wang et al.2008] Wang, T.; Lizotte, D.; Bowling, M.; and Schuurmans, D. 2008. Dual representations for dynamic programming. Journal of Machine Learning Research 1–29.
  • [White2017] White, M. 2017. Unifying task specification in reinforcement learning.

Supplementary Material

See 1


Under the spectral norm assumption that , the matrix is invertible. Hence,


Where the second step uses the the Submultiplicative matrix norm property, proving the inequality. Further, to show that minimizes the error upper bound, we just need to see that both, and are minimized when , where . ∎

See 2


Let be fixed, write and , such that the entries of sum to 1. We expand the definition of :

From standard convergence results from Markov chain theory [Meyn and Tweedie2012], we know that for any vector with nonnegative entries and for which . If is the vector of ones, we have

where is bounded and as . Hence

See 1


First we show that if vectors co-linear with are not in the span of , can’t be a fixed point of . By. contradiction,

We now show that for a and ,

can’t be a fixed point. Using the fact that the spectral norm of symmetric matrix is it’s largest eigenvalue, and that

is a matrix similar to a transitions matrix and thus has the same set of eigenvalues, the largest of which is , then . In particular, and . Again, by contradiction,

See 1


Trivially, since the operator is linear, the normalized COP operator . Using Theorem 2 we can state that and the normalization term ensures that . ∎

See 1


By definition, . First note that as defined is a distribution over : and since ,

We make use of the fact that to write

See 2


We will prove that is a fixed point of . Its uniqueness will be guaranteed by noting that the -step operator is a contraction mapping (Theorem 4 below) and invoking Banach’s fixed point theorem.

First note that for any , if denotes the vector of ratios then , and in particular . We write

See 3


as ,

See 3