# Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods

This paper investigates estimating the variance of a temporal-difference learning agent's update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the agent's value estimates during learning--before terminal outcomes are observed--we must use a different estimation target called the λ-return, which truncates the return with the agent's own estimate of the value function. Temporal difference learning methods estimate the expected λ-return for each state, allowing these methods to update online and incrementally, and in most cases achieve better generalization error and faster learning than Monte Carlo methods. Naturally one could attempt to estimate higher-order moments of the λ-return. This paper is about estimating the variance of the λ-return. Prior work has shown that given estimates of the variance of the λ-return, learning systems can be constructed to (1) mitigate risk in action selection, and (2) automatically adapt the parameters of the learning process itself to improve performance. Unfortunately, existing methods for estimating the variance of the λ-return are complex and not well understood empirically. We contribute a method for estimating the variance of the λ-return directly using policy evaluation methods from reinforcement learning. Our approach is significantly simpler than prior methods that independently estimate the second moment of the λ-return. Empirically our new approach behaves at least as well as existing approaches, but is generally more robust.

## Authors

• 6 publications
• 3 publications
• 7 publications
• 3 publications
• 17 publications
• 44 publications
• 32 publications
• ### Incrementally Learning Functions of the Return

Temporal difference methods enable efficient estimation of value functio...
07/05/2019 ∙ by Brendan Bennett, et al. ∙ 0

• ### Leveraging the Variance of Return Sequences for Exploration Policy

This paper introduces a method for constructing an upper bound for explo...
11/17/2020 ∙ by Zerong Xi, et al. ∙ 0

• ### The Concept of Criticality in Reinforcement Learning

Reinforcement learning methods carry a well known bias-variance trade-of...
10/16/2018 ∙ by Yitzhak Spielberg, et al. ∙ 0

• ### Per-decision Multi-step Temporal Difference Learning with Control Variates

Multi-step temporal difference (TD) learning is an important approach in...
07/05/2018 ∙ by Kristopher De Asis, et al. ∙ 0

• ### Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

Reinforcement Learning (RL) can model complex behavior policies for goal...
05/21/2017 ∙ by Sahil Sharma, et al. ∙ 0

• ### RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents

Current value-based multi-agent reinforcement learning methods optimize ...
02/16/2021 ∙ by Wei Qiu, et al. ∙ 0

• ### In Hindsight: A Smooth Reward for Steady Exploration

In classical Q-learning, the objective is to maximize the sum of discoun...
06/24/2019 ∙ by Hadi S. Jomaa, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Conventionally in reinforcement learning, the agent estimates the expected value of the return—the discounted sum of future rewards, as an intermediate step to find an optimal policy. Given a trajectory of experience, the agent can average the returns observed from each state. To estimate the value function online—while the trajectory unfolds—we update the agent’s value estimates towards the expected -return. The -return has the same expected value as the return, but can be estimated online using a memory trace. Algorithms that estimate the expected value of the -return are called temporal-difference learning methods. The first moment, however, is not the only statistic that can be estimated. In addition to the expected value, we could estimate the variance of the -return.

An estimate of the variance of the -return can be used in several ways to improve estimation and decision-making. Sato2002,Ghavamzadeh,Tamar2012,Tamar2013b use an estimate of the variance of the -return to design algorithms that account for risk in decision making. Specifically they formulate the agent’s objective as maximizing reward, while minimizing the variance of the -return. White2016b estimated the variance of the -return, , to automatically adapt the trace-decay parameter, , used in learning updates. This resulted in faster learning for the agent, but more importantly removed the need to tune by hand.

The variance can be estimated directly or indirectly. Indirect estimation involves estimating the first moment (the value ) and second moment () of the return and taking their difference as: . Sobel1982 were the first to formulate a Bellman operators for . Later Tamar2016,Tamar2013b,Ghavamzadeh, extended Sobel1982’s approach to estimating the variance for to . Finally, White2016b introduced an estimation method called VTD, that supports off-policy learning Sutton2009,Maei2011, state-dependent discounts and state-dependent trace-decay parameters. An alternative approach is to estimate the variance of the -return directly. This has been considered by Tamar2012, but they were unable to derive a Bellman operator—instead giving a Bellman-like operator—and considered only cost-to-go problems.

In this paper, we show that one can use temporal-difference learning, a online method for estimating value functions Sutton1988, to estimate directly. Our new method supports off-policy learning, state-dependent discounts, and state-dependent trace-decay parameters. We introduce a new Bellman operator for the variance of the -return, and further prove that even for a value function that does not satisfy the Bellman operator for the expected

-return, the error in this recursive formulation is proportional to the error in the value function approximation. Interestingly, the Bellman operator for the second moment requires an unbiased estimate of the

-return White2016b; our Bellman operator for the variance avoids this term, and so has a simpler update. Both our direct method and VTD can be seen as a network of two TD estimators running sequentially (Figure 1).

Our goal is to understand the empirical properties of the direct and indirect approaches for estimating variance, as neither have yet been thoroughly studied. In general, we found that direct estimation is just as good as VTD, and in many cases better. Specifically, we observe that the direct approach is better behaved in the early stages of learning before the value function has converged. Further, we observe that the variance of the estimates can be higher for VTD under several circumstances: (1) when there is a mismatch in step-size between the value estimator and the estimator, (2) when traces are used with the value estimator, (3) when estimating of the off-policy return, and (4) when there is error in the value estimate. Overall, we conclude that the direct approach to estimating is both simpler and better behaved than VTD.

## 2 The MDP Setting

We model the agent’s interaction with the environment as a finite Markov decision process (MDP) consisting of a finite set of states

, a finite set of actions, , and a transition model

defining the probability

of transition from state to when taking action . In the policy evaluation setting considered in this paper, the agent follows a fixed policy that provides the probability of taking action in state . At each timestep the agent receives a random reward , dependent only on .

The return is the discounted sum of future rewards

 Gt =Rt+1+γt+1Rt+2+γt+1γt+2Rt+3+… (1) =Rt+1+γt+1Gt+1.

The discount function , with , provides a variable level of discounting depending on the state Sutton2011. The value of a state, , is defined as the expected return from state under a particular policy

 j(s)= Eπ[Gt|St=s]. (2)

We use to indicate the true value function and the estimate. The TD-error is the difference between the one-step approximation and the current estimate:

 δt=Rt+1+γt+1Jt(St+1)−Jt(St). (3)

The -return

 Gλt=Rt+1+γt+1(1−λt+1)Jt(St+1)+γt+1λt+1Gλt+1

provides a bias-variance trade-off by incorporating , which is a potentially lower-variance but biased estimate of the return. This trade-off is determined by a state-dependent trace-decay parameter, . When is equal to the expected return from , then , and so the -return is unbiased. Beneficially, however, the expected value is lower-variance than the sample . If is inaccurate, however, some bias is introduced. Therefore, when , the -return is lower-variance but can be biased. When , the -return equals the Monte Carlo return (Equation (1)); in this case, the update target exhibits more variance, but no bias. In the tabular setting evaluated in this paper, does not affect the fixed point solution of the value estimate, only the rate at which learning occurs. It does, however, affect the observed variance of the return, which we estimate. The -return is implemented using traces as in the following TD() algorithm, shown with accumulating traces:

 Et(s) ←{γtλtEt−1(s)+1s=StγtλtEt−1(s)∀s∈S,s≠St (4) Jt+1(St) ←Jt(St)+αδtEt(St)

## 3 Estimating the Variance of the Return

When estimating , we have both a value estimator and a variance estimator. The value estimator provides an estimate of the expected -return, known as the policy evaluation problem. The variance estimator provides an estimate of the variance of the -return. We show below how we can similarly use any TD method to learn the variance estimator, such as TD with accumulating traces (Equation 4).

Because we have two separate TD estimators—one each for and — they can select different trace-decay parameters for learning. In fact, as done by White2016b, the value estimator can use a different trace-decay parameter than the -return for which we are estimating . This is because the -return is defined for any given value function, regardless of how that value function is estimated. There are three possible trace-decay parameters: 1) the of the -return for which variance is being estimated, 2) that used by the traces of the value estimator (), 3) that used by the traces of the variance estimator ().

We summarize the notation here for easy reference. Variables without the bar refer to the value estimator and variables with bars refer to the variance estimator.

 j− true value function of the target policy π. J− estimate of j. R− reward used in the value function estimate. ¯R− meta-reward used in the variance estimate. λ− bias-variance parameter of the target λ-return. κ− trace-decay parameter of the value estimator. ¯κ− trace-decay parameter of the secondary estimator. γ− discounting function used by the value estimator. ¯γ− discounting function used by the variance estimator. δt− TD error of the value function at time t. ¯δt− TD error of the variance estimator at time t. M− estimate of the second moment. v− true variance of the return. V− estimate of v.

Our direct algorithm, shown here, uses TD(0) to estimate variance. For an expanded implementation with traces in the off-policy setting see Appendix A.

Direct Variance Algorithm

 ¯γt+1 ←γ2t+1λ2t+1 (5) ¯Rt+1 ←δ2t ¯δt ←¯Rt+1+¯γt+1Vt(s′)−Vt(s) Vt+1(s) ←Vt(s)+¯α¯δt

An alternative to this direct method is to instead estimate the second moment. The variant shown here is equivalent to on-policy VTD with no traces, , and the step-size for the second set of weights set to 0. Further, the Tamar TD(0) algorithm ([Tamar2016]) can be recovered from Equation 6 by using . This algorithm does not impose that the variance be non-negative.

Second Moment Algorithm (VTD)

 ¯γt+1← γ2t+1λ2t+1 ¯Rt+1← (Rt+1+γt+1Jt+1(s′))2−¯γt+1Jt+1(s′)2 ¯δt← ¯Rt+1+¯γt+1Mt(s′)−Mt(s) (6) Mt+1(s)← Mt(s)+¯α¯δt Vt+1(s)= Mt+1(s)−Jt+1(s)2

## 4 Derivation of the Direct Method

The derivation of the direct method follows from characterizing the Bellman operator for the variance of the -return. Theorem 1 gives a Bellman equation for the variance . It has precisely the form of a TD target with meta-reward and discounting function . Therefore, we can conveniently estimate using TD methods. Further, we show that even when the value function does not satisfy the Bellman equation, this results only in a proportional error in the variance estimator. We first show the result for the on-policy setting, for simplicity; the more general off-policy algorithm is provided in Appendix A

This result provides the first general Bellman operator directly for the variance. The Bellman operators for the variance are general, in that they allow for either the episodic or continuing setting, by using variable . Interestingly, by directly estimating variance, we avoid a second term in the cumulant, that is present in approaches that estimate the second moment Tamar2013b,Tamar2016,White2016b. While Tamar2012 also developed an approach to directly estimate the variance, their method defined a non-linear Bellman operator and is restricted to cost-to-go problems. Follow-up work moved to estimating the second-moment instead Tamar2013b,Tamar2016, but with simplifying assumptions that only considered expected reward from a state and assuming . The work developing VTD generalizes to any , but does not characterize error when using an inaccurate value function.

To have a well-defined solution to the fixed point, we need the discount to be less than one for some transition White2017,Yu2015. This corresponds to assuming that the policy is proper, for the cost-to-go setting Tamar2016.

###### Assumption 1.

The policy reaches a state where in a finite number of steps.

###### Theorem 1.

For any ,

 j(s) =E[Rt+1+γt+1J(St+1) | St=s] v(s) =E[δ2t+γ2t+1λ2t+1v(St+1) | St=s] (7)

Further, for approximate value function , if there is an bounding value estimates and covariance terms , then

 ∣∣V(s)−E[δ2t+γ2t+1λ2t+1V(St+1) | St=s]∣∣≤3ϵ(s)
###### Proof.

First we expand , from which we recover a series with the form of a return.

 Gλt−j(St) =Rt+1+γt+1(1−λt+1)j(St+1)−j(St)+γt+1λt+1(Gλt+1−j(St+1)) =Rt+1+γt+1j(St+1)−j(St)+γt+1λt+1(Gλt+1−j(St+1)) (8)

The variance of is therefore

 v(s) =E[(Gλt−E[Gλt|St=s])2|St=s] =E[(Gλt−j(s))2|St=s] (9) =E[(δt+γt+1λt+1(Gλt+1−j(St+1)))2|St=s] =E[δ2t|St=s]+E[γ2t+1λ2t+1(Gλt+1−j(St+1))2|St=s]+2E[γt+1λt+1δt(Gλt+1−j(St+1))|St=s]

Equation (7) follows from Lemma 1 in the appendix, showing .
Now consider the case where we estimate the variance of the -return of an approximate value function .

 V(s) =E[(Gλt−j(s)+J(s)−J(s))2|St=s] =E[(Gλt−J(s))2|St=s]+(J(s)−j(s))2+2E[Gλt−J(s)|St=s](J(s)−j(s)).

This last term simplifies to

 E[Gλt−J(s)|St=s] =E[Gλt−j(s)|St=s]+j(s)−J(s) =j(s)−J(s)

giving . We can use the same recursive form, therefore, as (9), giving

 V(s) =E[δ2t+γ2t+1λ2t+1V(St+1)|St=s]+2E[γt+1λt+1δt(Gλt+1−J(St+1))|St=s]−(J(s)−j(s))2

For the second term,

 ∣∣∣E[γt+1λt+1δt(Gλt+1−J(St+1))|St=s]∣∣∣= ∣∣∣E[γt+1λt+1δt(Gλt+1−j(St+1))|St=s] +E[γt+1λt+1δt(j(St+1)−J(St+1))|St=s]∣∣∣ = ∣∣∣E[γt+1λt+1δt(j(St+1)−J(St+1))|St=s]∣∣∣ ≤ ϵ(s).

where the second equality follows from Lemma 1 and the last step from the assumption about bounded covariance terms. Therefore,

 ∣∣∣V(s) −E[δ2t+γ2t+1λ2t+1V(St+1)|St=s]∣∣∣≤2ϵ(s)+(J(s)−j(s))2≤3ϵ(s)

## 5 Experiments

The primary purpose of these experiments is to demonstrate that both the direct method and VTD can approximate the true expected under various conditions in the tabular setting. We consider two domains. The first is a deterministic chain, in Figure 2, which is useful for basic evaluation and gives results which are easy to interpret. The second is a more complex MDP, in Figure 3, with different discount and trace-decay parameters in each state. For all experiments Algorithm 4 is used as the value estimator. Unless otherwise stated, traces are not used (

). For each experimental setting 30 separate experiments were run and the estimates averaged, with standard deviation shown as shaded regions in the plots. The true values were determined by Monte Carlo estimation and are shown as dashed lines in the figures. Unless otherwise stated, the estimates are all initialized to zero.

We look at the effects of relative step-size between the value estimator and the variance estimators in Section 5.1. In Section 5.2 we use the complex domain to show that both algorithms can estimate the variance with state-dependent and . In Section 5.3 we evaluate the two algorithms’ responses to errors in the value estimate. Section 5.4 looks at the effect of using traces in the estimation method. Finally, in Section 5.5 we examine the off-policy setting.

### 5.1 The Effect of Step-size

We use the chain MDP to investigate the impact of step-size choice. In Figure 4 all step-sizes are the same . Both algorithms behave similarly. For Figure 4 the step-size of the value estimate, , is greater than that of the variance estimators, . The direct algorithm smoothly approaches the correct value, while VTD first dips well below zero. This is to be expected as the estimates are initialized to zero and the variance is calculated as . If the second moment lags behind the value estimate then the variance will be negative. In Figure 4 the step-size for the variance estimators is larger than for the value estimator . While both methods overshoot the target, VTD has greater overshoot. For both cases of unequal step-size we see higher variance in the variance estimates for VTD.

Figure 5 explores this further. Here the value estimator is initialized to the true values and updates are turned off (). The variance estimators are initialized to zero and learn with , chosen simply to match the step-sizes used in the previous experiments. Despite being given the true values the VTD algorithm produces higher variance in its estimates, suggesting that VTD is dependent on the value estimator tracking.

This sensitivity to step-size is shown in Figure 6. All estimates are initialized to their true values. For each ratio we computed the average variance of the 30 runs of 2000 episodes. We can see that the direct method is largely insensitive to step-size ratio, but that VTD has higher mean squared error (MSE) except when the step-sizes are equal. This result holds for the other experimental settings of this paper, including the complex MDP, but further results are omitted for brevity.

These results beg the question, would there ever be a situation where different step-sizes between value and variance estimators is justified? Methods which automatically set the step-sizes may produce different values which are specific to the performance of each estimator. One such algorithm is ADADELTA, which adapts the step-size based on the TD error of the estimator Zeiler2012. Figure 7 shows that using a separate ADADELTA step-size calculation for each estimator results in higher variance for VTD as expected (ADADELTA: ), given that the value estimator and VTD produce different TD errors.

### 5.2 Estimating for State-dependent γ and λ.

One of the contributions of VTD was the generalization to support state-based and . Here we evaluate the complex MDP from Figure 3 (in the on-policy setting, using ), which was designed for this scenario and which has a stochastic policy, is continuing, and has multiple possible actions from each state. Figure 8 shows that both algorithms estimate with similar results. This experiment was run with all step-sizes equal ().

### 5.3 Variable Error in the Value Estimates

The derivation of our direct algorithm assumes access to the true value function. The experiments of the previous sections demonstrate that both methods are robust under this assumption, in the sense that the value function was estimated from data and used to estimate . It remains unclear, however, how well these methods perform when the value estimates converge to biased solutions.

To examine this we again use the complex MDP shown by Figure 3. True values for the value functions and variance estimates are calculated from Monte Carlo simulation of 10,000,000 timesteps. For each run of the experiment each state of the value estimator was initialized to the true value plus an error (

) drawn from a uniform distribution:

, where (the maximum value in this domain is 1.55082409). The value estimate was held constant throughout the run . The experiment consisted of 120 runs of 80,000 timesteps. To look at the steady-state response of the algorithms we use only the last 10,000 timesteps in our calculations. Figure 9 plots the average variance estimate for each state. Additionally we show the average standard deviation of the estimates in the shaded regions. Sweeps over step-size were conducted, , and the MSE evaluated for each state. Each data point is for the step-size with the lowest MSE for that error ratio and state. While the average estimate is closer to the true values for VTD, the variance of the estimates is much larger. Further, the average estimates for VTD are either unchanged or move negative, while those of the direct algorithm tend toward positive bias.

For Figure 10 the MSE is summed over all states. Again, for each error ratio the MSE was compared over the same step-sizes as before and for each point the smallest MSE is plotted.

These results suggest the direct algorithm is less affected by error in .

### 5.4 Experiments with Traces

In this section we briefly look at the behavior of the complex domain when traces are used. For Figure 11 traces are used for the variance estimators, but not for policy evaluation () and the step-sizes are all equal (0.01). Here we see no significant difference between VTD and the direct algorithm. For Figure 11 we look at the opposite scenario, where traces are used for policy evaluation, but not in the variance estimators (). Here we do see a difference, particularly the VTD method shows more variance in its estimates for State 0 and 3.

### 5.5 Experiments in an Off-policy Setting

In the off-policy setting the agent follows a behavior policy , but is estimating the value of a target policy . The ratio between these two policies is called the importance sampling ratio, , and is used to modify the value function update.

We evaluate two different off-policy scenarios on the complex MDP. In the first scenario we estimate under the target policy from off-policy samples. That is, we estimate that would be observed if we were following the target policy. In this scenario . Figure 12 shows that both methods are able achieve the same results in this setting.

In the second off-policy setting we estimate the variance of the off-policy return, which is the return being used to update the value estimator and is simply the multiplication of the -return by . In this scenario and . Figure 13 shows that both algorithms successfully estimate the return in this setting. However, despite having the same step-size as the value estimator, VTD produces higher variance in its estimates, as is most clearly seen in State 3.

## 6 Discussion

Both the direct method and VTD effectively estimate the variance across a range of settings, but the direct method is simpler and more robust. This simplicity alone makes the direct method preferable. The higher variance in estimates produced by VTD is likely due to the inherently larger target which VTD uses in its learning updates: ; we show more explicitly how this affects the updates of VTD in Appendix D. One would expect the differences between the two approaches to be most pronounced for domains with larger returns than those demonstrated here. Our focus was simple MDPs. In such settings we can define clear experiments where the properties of these variance estimation algorithms can be carefully evaluated isolated from additional effects like state-aliasing due to function approximation. Consider the task of helicopter hovering formalized as a reinforcement learning task Ng2004. In the most well-known variants of this problem the agent receives massive negative reward for crashing the helicopter (e.g., minus one million). In such problems the magnitude and variance of the return is large. In such cases, estimating the second moment may not be feasible from a statistical point of view, whereas the target of our direct variance estimate should be better behaved.

We focused on the tabular case, where each state is represented uniquely. Future work will investigate extending our theoretical characterization and experiments to the function approximation case. Our algorithm extends naturally with little modification. To extend the theory, there have been some promising results characterizing fixed points under the projected Bellman operator for the second moment Tamar2016. An extension to projected Bellman operators could also further help bound errors incurred from inaccuracies in the value function.

## 7 Conclusion

In this paper we introduced a simple method for estimating the variance of the -return using temporal difference learning. Our approach is simpler than existing approaches, and appears to work better in practice. We performed an extensive empirical study. Our findings suggest that our new method outperforms VTD when: (1) there is a mismatch in step-size between the value estimator and the variance estimator, (2) traces are used with the value estimator, (3) estimating variances of the off-policy return, and (4) there is error in the value estimate.

## Acknowledgements

Funding for this work was provided by the Natural Sciences and Engineering Research Council of Canada, Alberta Innovates, and Google DeepMind.

## Appendix A Variance Estimation in the Off-Policy Setting

Value estimates are made with respect to a target policy, . If the behavior policy, , is the same as the target policy then we say that samples are collected on-policy and when they are not the same, the samples are collected off-policy. A common approach for off-policy learning algorithms is to weight each update by the importance sampling ratio: . Off-policy estimates are then implemented by multiplying the trace updates by :

 Et(s) ←{ρt(γtλtEt−1(s)+1)s=StρtγtλtEt−1(s)∀s∈S,s≠St.

There are two different scenarios to be considered in the off-policy setting. The first scenario is estimating the variance of the (on-policy) -return of the target policy, while following a different behavior policy. In the second setting, the goal is to estimate the variance of the off-policy -return. The off-policy -return is

 Gλt=ρt(Rt+1+γt+1(1−λt+1)jt(St+1)+γt+1λt+1Gλt+1).

where the multiplication by the potentially large importance sampling ratios can significantly increase variance.

It is important to note that you would only ever estimate one or the other off-policy variance with a given estimator. Let be the weighting for the value estimator, and the weighting for the variance estimator. If estimating the variance of the target return from off-policy samples, the first scenario, and . If estimating the variance of the off-policy return and .

Here we present the resulting algorithms which use TD() estimators with accumulating traces.

Direct Variance Algorithm

 ¯Rt+1 ←(ηtδt+(ηt−1)Jt+1(s))2 (10) ¯γt+1 ←γ2t+1λ2t+1η2t ¯δt ←¯Rt+1+¯γt+1Vt(s′)−Vt(s) ¯Et(s) ←{¯ρt(¯γt¯κt¯Et−1(s)+1)s=St¯ρt(¯γt¯κt¯Et−1(s))∀s∈S,s≠St Vt+1(s) ←Vt(s)+¯α¯δt¯Et(s)

Variance is computed directly as .

Second Moment Algorithm

 ¯Gt ←Rt+1+γt+1(1−λt+1)Jt+1(s′) (11) ¯Rt+1 ←η2t¯G2t+2η2tγt+1λt+1¯GtJt+1(s′) ¯γt+1 ←η2tγ2t+1λ2t+1 ¯δt ←¯Rt+1+¯γt+1Mt(s′)−Mt(s) ¯Et(s) ←{¯ρt(¯γt¯κt¯Et−1(s)+1)s=St¯ρt(¯γt¯κt¯Et−1(s))∀s∈S,s≠St Mt+1(s) ←Mt(s)+¯α¯δt¯Et(s)

Variance is computed as .

For convenience we summarize the variables used:

 J− estimated value function of the target policy π. R− reward used in the value function estimate. ¯R− meta-reward used in the variance estimate. λ− bias-variance parameter of the target λ-return. κ− trace-decay parameter of the value estimator. ¯κ− trace-decay parameter of the secondary estimator. γ− discounting function used by the J estimator. ¯γ− discounting function used by the V estimator. δt− TD error of the value function at time t. ¯δt− TD error of the variance estimator at time t. M− estimate of the second moment. V− estimate of the variance. ¯ρ− importance sampling ratio for estimating the variance of the target return from off-policy samples. η− importance sampling ratio used to estimate the variance of the off-policy return.

## Appendix B Bellman Operators for the Variance in the Off-Policy Setting

###### Lemma 1.

For , i.e., satisfying the Bellman equation, for any bounded function ,

 E[b(St,At,Rt+1,St+1)(Gλt+1−j(St+1))|St=s]=0
###### Proof.

Let . By the law of total expectation:

Given , , and , is constant and can be moved outside of the expectation. Therefore,

 E [bt(Gλt+1−j(St+1))∣∣St,At,Rt+1,St+1]=E[bt∣∣St,At,Rt+1,St+1]×E[Gλt+1−j(St+1)∣∣St,At,Rt+1,St+1]

Because

 E[Gλt+1−j(St+1)∣∣St,At,Rt+1,St+1]=0

the result follows. ∎

###### Theorem 2.
 v(s)=E[(ηtδt+(ηt−1)j(s))2+λ2t+1γ2t+1η2tv(St+1)|St=s]
###### Proof.

The proof is similar to the proof of Theorem 1.

 v(s)= E[{Gλt−j(St)}2|St=s] = E[{ηtRt+1+ηtγt+1(1−λt+1)j(St+1)+ηtγt+1λt+1Gλt+1−v(s)}2|St=s] = E[{ηtRt+1+ηtγt+1j(St+1)−ηtj(s)+ηtj(s)−ηtγt+1λt+1j(St+1) +ηtγt+1λt+1Gλt+1−j(s)}2|St=s] = E[{(ηtδt +(ηt−1)j(s))+ηtγt+1λt+1(Gλt+1−j(St+1))}2|St=s] = E[(ηtδt+(ηt−1)j(s))2+η2tγ2t+1λ2t+1(Gλt+1−j(St+1))2 +2ηtγt+1λt+1(ηtδt+(ηt−1)j(s))(Gλt+1−j(St+1))|St=s] = E[(ηtδt+(ηt−1)j(s))2+η2tγ2t+1λ2t+1(Gλt+1−j(St+1))2 +2η2tγt+1λt+1δt(Gλt+1−j(St+1))+2ηtγt+1λt+1(ηt−1)j(s)(Gλt+1−j(St+1))|St=s]

Using Lemma 1, with different fixed functions , we can conclude that the last two terms are zero, giving

 v(s)= E[(ηtδt+(ηt−1)j(s))2+η2tγ2t+1λ2t+1(Gλt+1−j(St+1))2|St=s] By the law of total expectation v(s)= E[(ηtδt+(ηt−1)j(s))2+E[η2tγ2t+1λ2t+1(Gλt+1−j(s′))2|St+1=s′]|St=s] = E[(ηtδt+(ηt−1)j(s))2+η2tγ2t+1λ2t+1v(St+1)|St=s].

completing the proof. ∎

Theorem 2 gives a Bellman equation for in the more general off-policy setting. The resulting TD algorithm uses meta-reward and discounting function .

The step-sizes generated by the ADADELTA algorithm in Figure 7 are shown in Figure 14. As we evaluate in the tabular case at each timestep only the step-size for the current state has any impact. Thus, the values shown here are the average step-size used over each episode.

## Appendix D Variability in Updates

In this section, we show the effective update to on each timestep for each of the two algorithms in the on-policy setting. For notational clarity let .

For the direct algorithm the change is just:

 ΔVt(s)=¯α(δ2+¯γVt(s′)−Vt(s)). (12)

The updates for the VTD algorithm are much more complicated to compute and we will make some assumptions about the domain in order to simplify the derivation. First we compute the change in the second moment and value estimators separately.

We first expand the term :

 δ= r+γJt(s′)−Jt(s) δ2= (r+γJt(s′))2−2(r+γJt(s′))Jt(s)+Jt(s)2.

Now we expand the change in the second moment estimate, . To simplify the expansion we make the assumption that at each transition the agent moves to a new state, i.e. (this is not required for our algorithm, but simplifies the expansions below). This assumption holds for both of the domains examined in this paper. This allows us to substitute , which greatly simplifies the updates.

 ΔM(s)= ¯α[(r+γJt+1(s′))2−¯γ2Jt+1(s′)2+¯γMt(s′)−Mt(s)] = ¯α[(r+γJt(s′))2−¯γ2Jt(s′)2+¯γMt(s′)−Mt(s)] = ¯α[(r+γJt(s′))2−2(r+γJt(s′))