1 Off-Policy Temporal-Difference Algorithms
A central challenge in reinforcement learning is to learn the long-term consequences of making decisions according to a specific strategy. This problem is known as the prediction problem, and involves an agent learning a value function for a given policy, often using some form of temporal-difference learning sutton1988learning. One way of doing this is to simply execute the policy, observe the effects, and update an approximate value function—an approach known as on-policy learning. However, there are many applications where this approach can be expensive (e.g., advertising bottou2013counterfactual), dangerous (medicine liao2021off), or inefficient (robotics smart2002effective and educational applications koedinger2013new).
An alternative approach where the agent learns about a target policy that is different from the behaviour policy being executed is known as off-policy learning. Off-policy algorithms allow the agent to learn from experience generated by old policies (experience replay lin1992self,schaul2015prioritized), exploratory policies (Q-learning watkins1992q), human demonstrations smart2002effective, non-learning controllers, or even random behaviour. They also enable offline learning for algorithms that are too computationally demanding to run online levine2020offline. Perhaps most importantly for the development of AI, off-policy algorithms allow an agent to learn about many possible ways of behaving, in parallel, from a single stream of experience sutton2011horde,white2015developing,klissarov2021flexible.
There are several algorithms for learning value functions off-policy that have been proven to converge with function approximation, including the family of Gradient TD algorithms sutton2009fast,hackman2013faster,ghiassian2020gradient, Emphatic TD sutton2016anemphatic, Tree Backup precup2000eligibility,touati2018convergent,ghiassian2022online, Retrace munos2016safe, ABQ mahmood2017multi, and others dann2014policy,white2016investigating,geist2014off,ghiassian2022online. Aside from Tree Backup and ABQ, all the aforementioned algorithms use importance sampling rubinstein2016simulation to correct for the difference in probability assigned to actions by the target and behaviour policies. However, the variance introduced by importance sampling is a key limitation of these algorithms precup2000eligibilityliu2020understanding.
A variety of importance sampling variants have been developed for off-policy learning in an attempt to reduce variance: per-decision precup2000eligibility, weighted precup2000eligibility, discounting-aware sutton2018reinforcementmahmood2017incremental, and stationary state distribution hallak2017consistentliu2018breaking importance sampling. However, when inspecting the update rules for most algorithms that learn value functions off-policy, the placement of importance sampling ratios does not correspond to any of the known importance sampling variants.
In this extended abstract, we investigate this inconsistency and empirically compare the performance of several existing algorithms with versions that strictly implement per-decision importance sampling. We find that the per-decision versions almost always perform worse, and show how scaling the entire TD error can be interpreted as a control variate, often reducing variance and improving performance.
2 Importance Sampling Placement
The goal of off-policy value function learning is to estimate the expected sum of future rewards (referred to as thevalue) that would be received when executing target policy from each state, using observed rewards generated by executing behaviour policy . However, the behaviour policy may choose actions with different probabilities than the target policy. To correct for this discrepancy, the observed sum of rewards can be scaled by the relative probability of taking each action in the trajectory under the target and behaviour policies, known as the importance sampling ratio and denoted . However, each reward only needs to be scaled by the importance sampling ratios that precede it in the trajectory, as rewards cannot depend on decisions made in the future. This is the idea behind the Per-Decision Importance Sampling-corrected return
whose expectation under the behaviour policy is equal to , and whose variance is often lower than scaling each reward by the importance sampling ratios for all actions in the trajectory, as is done in ordinary importance sampling precup2000eligibility. The recursive nature of the PDIS return gives rise to an off-policy Bellman equation:
which yields an off-policy Bellman Error by subtracting from both sides and replacing the true value function with an approximate value function
parameterized by a weight vector:
Samples of the Bellman Error are known as the Temporal-Difference error
and form the basis for off-policy temporal-difference learning methods that use importance sampling. However, most algorithm update rules in the literature use a different TD error
that multiplies by the importance sampling ratio for the action that follows it, despite this not actually being necessary. So what is the rationale for scaling by ? After all, the value estimate for state does not need to be corrected with the importance sampling ratio for the action that follows it, and doing so is contrary to the motivation for using per-decision importance sampling; only the terms in the return that need to be corrected should be.
It turns out that scaling by can be interpreted as applying a control variate to the update target of the TD error. Control variates use the error in an estimate of a quantity with known expected value to reduce the error in an estimate of a correlated quantity with an unknown expected value. Given an estimator with an unknown expected value , we can subtract a control variate and add its known mean to obtain a new estimator with the same expected value, but with lower variance if is correlated with :
We can see that remains unbiased, as the expected value of is simply . Analyzing its variance yields:
so we would expect the variance of to be reduced relative to when there is a strong covariance between and . As both are estimates of and appears in both terms, it’s likely there exists a strong covariance between the two terms, especially when the value estimates are fairly accurate.
To test this hypothesis we conducted an experiment on the Collision task, a small environment with eight states and two actions, shown in Figure 1 ghiassian2021empirical. Under the target policy, the agent always takes the forward action, and under the behavior policy, it takes the forward action in the first four states and takes the forward and turnaway actions in the four rightmost states.
We compared 2 versions of 10 different off-policy prediction learning algorithms, including GTD(), GTD2(), Proximal GTD2(), HTD(), Emphatic TD(), Emphatic TD(), Off-policy TD(), Vtrace(), Tree Backup(), and ABTD(). One version used in the update rule, and the other version used . We checked 19 values of the first step-size parameter, , for all algorithms: where . For Gradient-TD algorithms we tried 15 values of where . The values of we checked were: with . For Emphatic TD() we tried all combinations of the first step-size parameter and . We set the bootstrapping parameter to 0. In all experiments, we set
at the beginning of each run and ran the experiment for 20,000 time steps and 50 independent runs. All the results presented are averages over the 50 runs and show the standard error over runs as a shaded region.
The learning curves for the best algorithm instances (the algorithm with the parameter setting that resulted in the smallest area under the learning curve) for all algorithms are shown in Figure 1(a). We see that in all cases, using performed better than using ; the blue curve plateaued statistically significantly sooner than the red curve.
The parameter sensitivity curves for all algorithms are shown in Figure 1(b). For algorithms that had more than one parameter, we plotted the sensitivity curve that included the best algorithm instance. We first found the algorithm instance that had the smallest area under the curve, and then fixed all the parameters, and plotted the results over the step-size parameter. For all algorithms, when the whole TD error was corrected, the parameter sensitivity curve was wider, meaning that it was easier to choose a good step-size for the algorithm.
These results make it clear that correcting the whole TD error should be preferred over partially correcting the TD error when designing and implementing off-policy value function learning algorithms. Correcting the whole TD error led to better performance for every algorithm involved, and also reduced every algorithm’s sensitivity to the step-size parameter, making it easier to select good step-sizes.