## 1 Introduction

In reinforcement learning, the training data is produced by an adaptive learning agent’s interaction with its environment, which makes tuning the parameters of the learning process both challenging and essential for good performance. In the online setting we study here, the agent-environment interaction produces an unending stream of temporally correlated data. In this setting there is no testing-training split, and thus the agent’s learning process must be robust and adapt to new situations not considered by the human designer. Robustness is often critically related to the values of a small set parameters that control the learning process (e.g., the step-size parameter). In real-world applications, however, we cannot expect to test a large range of theses parameter values, in all the situations the agent may face, to ensure good performance—common practice in empirical studies. Unfortunately, safe values of these parameters are usually problem dependent. For example, in off-policy learning (e.g., learning from demonstrations), large importance sampling ratios can destabilize provably convergent gradient temporal difference learning methods, when the parameters are not set in a very particular way () white2015thesis . In such situations, we turn to meta-learning algorithms that can adapt the parameters of the agent continuously, based on the stream of experience and some notion of the agent’s own learning progress. These meta-learning approaches can potentially improve robustness, and also help the agent specialize to the current task, and thus improve learning speed.

Temporal difference learning methods make use of two important parameters: the step-size parameter and the trace-decay parameter. The step-size parameter is the same as those used in stochastic gradient descent, and there are algorithms available for adjusting this parameter online, in reinforcement learning

dabney2012adaptive . For the trace decay parameter, on the other hand, we have no generally applicable meta-learning algorithms that are compatible with function approximation, incremental processing, and off-policy sampling.The difficulty in adapting the trace decay parameter, , mainly arises from the fact that it has seemingly multiple roles and also influences the fixed-point solution. This parameter was introduced in Samuel’s checker player samuel1959some

, and later described as interpolation parameter between offline TD(0) and Monte-Carlo sampling (TD(

) by Suttonsutton1988learning . It has been empirically demonstrated that values of between zero and one often perform the best in practice sutton1988learning ; sutton1998introduction ; vanseijen2014true. This trace parameter can also be viewed as a bias-variance trade-off parameter:

closer to one is less biased but likely to have higher variance, where closer to zero is more biased, but likely has lower variance. However, it has also been described as a credit-assignment parameter singh1996reinforcement, as a method to encode probability of transitions

sutton1994onstep, a way to incorporate the agent’s confidence in its value function estimates

sutton1998introduction ; tesauro1992practical , and as an averaging of n-step returns sutton1998introduction . Selecting is further complicated by the fact that is a part of the problem definition: the solution to the Bellman fixed point equation is dependent on the choice of (unlike the step-size parameter).There are few approaches for setting , and most existing work is limited to special cases. For instance, several approaches have analyzed setting for variants of TD that were introduced to simplify the analysis, including phased TD kearns2000bias and TD schapire1996ontheworst . Though both provide valuable insights into the role of , the analysis does not easily extend to conventional TD algorithms. Sutton and Singh sutton1994onstep investigated tuning both the learning rate parameter and , and proposed two meta-learning algorithms. The first assumes the problem can be modeled by an acyclic MDP, and the other requires access to the transition model of the MDP. Singh and Dayan singh1996analytical and Kearns and Singh kearns2000bias contributed extensive simulation studies of the interaction between and other agent parameters on a chain MDP, but again relied on access to the model and offline computation. The most recent study downey2010temporal explores a Bayesian variant of TD learning, but requires a batch of samples and can only be used off-line. Finally, Konidaris et al. konidaris2011td introduce TD as a method to remove the parameter altogether. Their approach, however, has not been extended to the off-policy setting and their full algorithm is too computationally expensive for incremental estimation, while their incremental variant introduces a sensitive meta-parameter. Although this long-history of prior work has helped develop our intuitions about , the available solutions are still far from the use cases outlined above.

This paper introduces an new objective based on locally optimizing bias-variance, which we use to develop an efficient, incremental algorithm for learning state-based . We use a forward-backward analysis sutton1998introduction to derive an incremental algorithm to estimate the variance of the return. Using this estimate, we obtain a closed-form estimate of on each time-step. Finally, we empirically demonstrate the generality of the approach with a suite of on-policy and off-policy experiments. Our results show that our new algorithm, -greedy, is consistently amongst the best performing, adapting as the problem changes, whereas any fixed approach works well in some settings and poorly in anothers.

## 2 Background

We model the agent’s interaction with an unknown environment as a discrete time Markov Decision Process (MDP). A MDP is characterized by a finite set of states

, set of actions , a reward function , and generalized state-based discount , which encodes the level of discounting per-state (e.g., a common setting is a constant discount for all states). On each of a discrete number of timesteps, , the agent observes the current state , selects an action , according to its target policy , and the environment transitions to a new state and emits a reward . The state transitions are governed by the transition function , where denotes the probability of transitioning from to , due to action . At timestep , the future rewards are summarized by the Monte Carlo (MC) return defined by the infinite discounted sumThe agent’s objective is to estimate the expected return or value function, , defined as . We estimate the value function using the standard framework of linear function approximation. We assume the state of the environment at time

can be characterized by a fixed-length feature vector

, where ; implicitly,is a function of the random variable

. The agent uses a linear estimate of the value of : the inner product of and a modifiable set of weights , , with mean-squared error (MSE) , where encodes the distribution over states induced by the agent’s behavior in the MDP.Instead of estimating the expected value of , we can estimate a -return that is expected to have lower variance

where the trace decay function specifies the trace parameter as a function of state. The trace parameter averages the estimate of the return, , and the -return starting on the next step, . When , becomes the MC return , and the value function can be estimated by averaging rollouts from each state. When , becomes equal to the one-step -return, and the value function can be estimated by the linear TD(0) algorithm. The -return when is often easier to estimate than MC, and yields more accurate predictions than using the one-step return. The intuition, is that the for large , the estimate is high-variance due to averaging possibly long trajectories of noisy rewards, but less bias because the initial biased estimates of the value function participate less in the computation of the return. In the case of low , the estimate has lower-variance because fewer potentially noisy rewards participate in , but there is more bias due to the increase role of the initial value function estimates. We further discuss the intuition for this parameter in the next section.

The generalization to state-based and have not yet been widely considered, though the concept was introduced more than a decade ago sutton1995td ; sutton1999between and the generalization shown to be useful sutton1995td ; maei2010gq ; modayil2014multi ; sutton2015anemphatic . The Bellman operator can be generalized to include state-based and (see (sutton2015anemphatic, , Equation 29)), where the choice of per-state influences the fixed point. Time-based , on the other hand, would not result in a well-defined fixed point. Therefore, to ensure a well-defined fixed point, we will design an objective and algorithm to learn a state-based .

This paper considers both on- and off-policy policy evaluation. In the more conventional on-policy learning setting, we estimate based on samples generated while selecting actions according to the target policy . In the off-policy case, we estimate based on samples generated while selecting actions according to the behavior policy , and . In order to learn in both these settings we use the GTD() algorithm maei2011gradient specified by the following update equations:

with step-sizes and an arbitrary initial (e.g., the zero vector). The importance sampling ratio facilitates learning about rewards as if they were generated by following , instead of . This ratio can be very large if is small, which can compound and destabilize learning.

## 3 Objective for trace adaptation

To obtain an objective for selecting , we need to clarify its role. Although was not introduced with the goal of trading off bias and variance sutton1988learning , several algorithms and significant theory have developed its role as such kearns2000bias ; schapire1996ontheworst . Other roles have been suggested; however, as we discuss below, each of them can still be thought of as a bias-variance trade-off.

The parameter has been described as a credit assignment parameter, which allows TD() to perform multi-step updates on each time step. On each update, controls the amount of credit assigned to previous transitions, using the eligibility trace . For close to 1, TD() assigns more credit for the current reward to previous transitions, resulting in updates to many states along the current trajectory. Conversely, for , the eligibility trace is cleared and no credit is assigned back in time, performing a single-step TD(0) update. In fact, this intuition can still be thought of as a bias-variance trade-off. In terms of credit assignment, we ideally always want to send maximal credit , but decayed by , for the current reward, which is also unbiased. In practice, however, this often leads to high variance, and thus we mitigate the variance by choosing less than one and speed learning overall, but introduce bias.

Another interpretation is that should be set to reflect confidence in value function estimates tesauro1992practical ; sutton1998introduction . If your confidence in the value estimate of state is high, then should be close to 0, meaning we trust the estimates provided by . If your confidence is low, suspecting that may be inaccurate, then should be close to 1, meaning we trust observed rewards more. For example in states that are indistinguishable with function approximation (i.e., aliased states), we should not trust the as much. This intuition similarly translates to bias-variance. If is accurate, then decreasing does not incur (much) bias, but can significantly decrease the variance since gives the correct value. If is inaccurate, then the increased bias is not worth the reduced variance, so should be closer to 1 to use actual (potentially high-variance) samples.

Finally, a less commonly discussed interpretation is that acts as parameter that simulates a form of experience replay (or model-based simulation of trajectories). One can imagine that sending back information in eligibility traces is like simulating experience from a model, where the model could be a set of trajectories, as in experience replay lin1992self . If , the traces are longer and each update gets more trajectory information, or experience replay. If a trajectory from a point, however, was unlikely (e.g., a rare transition), we may not want to use that information. Such an approach was taken by Sutton and Singh sutton1994onstep , where was set to the transition probabilities. Even in this model-based interpretation, the goal in setting becomes one of mitigating variance, without incurring too much bias.

Optimizing this bias-variance trade-off, however, is difficult because affects the return we are approximating. Jointly optimizing for across all time-steps is generally not feasible. One strategy is to take a batch approach, where the optimal is determined after seeing all the data downey2010temporal . Our goal, however, is to develop approaches for the online setting, where future states, actions, rewards and the influence of have yet to be observed.

We propose to take a greedy approach: on each time step select to optimize the bias-variance trade-off for only this step. This greedy objective corresponds to minimizing the mean-squared error between the unbiased return and the estimate with with into the future after

Notice that interpolates between the current value estimate and the unbiased MC return, and so is not recursive. Picking

gives an unbiased estimate, since then we would be estimating

. We greedily decide how should be set on this step to locally optimize the mean-squared error (i.e., bias-variance). This greedy decision is made given both and , which are both available when choosing . To simplify notation in this section, we assume that and are both given in the below expectations.To minimize the mean-squared error in terms of

we will consider the two terms that compose the mean-squared error: the squared bias term and the variance term.

Let us begin by rewriting the bias. Since we are given , , and when choosing ,

For convenience, define

(1) |

as the difference between the return and the current approximate value from state using weights . Using this definition, we can rewrite

giving

For the variance term, we will assume that the noise in the reward given and is independent of the other dynamics mannor2004bias , with variance . Again since we are given , , and

Finally, we can drop the constant in the objective, and drop the in both the bias and variance terms as it only scales the objective, giving the optimization

We can take the gradient of this optimization to find a closed form solution

(2) |

which is always feasible, unless both the variance and error are zero (in which case, any choice of is equivalent). Though the importance sampling ratio does not affect the choice of on the current time step, it can have a dramatic effect on into the future via the eligibility trace. For example, when the target and behavior policy are strongly mis-matched, can be large, which multiplies into the eligibility trace . If several steps have large , then can get very large. In this case, the equation in (2) would select a small , significantly decreasing variance.

## 4 Trace adaptation algorithm

To approximate the solution to our proposed optimization, we need a way to approximate the error and the variance terms in equation (2). To estimate the error, we need an estimate of the expected return from each state, . To estimate the variance, we need to obtain an estimate of , and then can use . The estimation of the expected return is in fact the problem tackled by this paper, and one could use a TD algorithm, learning weight vector to obtain approximation to . This approach may seem problematic, as this sub-step appears to be solving the same problem we originally aimed to solve. However, as in many meta-parameter optimization approaches, this approximation can be inaccurate and still adequately guide selection of . We discuss this further in the experimental results section.

Similarly, we would like to estimate with by learning

; estimating the variance or the second moment of the return, however, has not been extensively studied. Sobel

sobel1982thevariance provides a Bellman equation for the variance of the -return, when . There is also an extensive literature on risk-averse MDP learning, where the variance of the return is often used as a measure morimura2010parametric ; mannor2011mean ; prashanth2013actor ; tamar2013temporal ; however, an explicit way to estimate the variance of the return for is not given. There has also been some work on estimating the variance of the value function mannor2004bias ; white2010interval , for general ; though related, this is different than estimating the variance of the -return.In the next section, we provide a derivation for a new algorithm called variance temporal difference learning (VTD), to approximate the second moment of the return for any state-based . The general VTD updates are given at the end of Section 5.2.
For -greedy, we use VTD to estimate the variance, with the complete algorithm
summarized in Algorithm 1.
We opt for simple meta-parameter settings, so that no
additional parameters are introduced.
We use the same step-size that is used for the main weights
to update and .
In addition, we set the weights and
to reflect *a priori* estimates of error and variance.
As a reasonable rule-of-thumb,
should be set larger than ,
to reflect that initial value estimates are inaccurate.
This results in an estimate of
variance
that is capped at zero until becomes larger
than .

## 5 Approximating the second moment of the return

In this section, we derive the general VTD algorithm to approximate the second moment of the -return. Though we will set in our algorithm, we nonetheless provide the more general algorithm as the only model-free variance estimation approach for general -returns.

The key novelty is in determining a Bellman operator for the squared return, which then defines a fixed-point objective, called the Var-MSPBE. With this Bellman operator and recursive form for the squared return, we derive a gradient TD algorithm, called VTD, for estimating the second moment. To avoid confusion with parameters for the main algorithm, as a general rule throughout the document, the additional parameters used to estimate the second moment have a bar. For example, is the discount for the main problem, and is the discount for the second moment.

### 5.1 Bellman operator for squared return

The recursive form for the squared-return is

where for a given and ,

The are the weights for the -return, and not the weights we will learn for approximating the second moment. For further generality, we introduce a meta-parameter

to get a -squared-return where for , . This meta-parameter plays the same role for estimating as for estimating .

We can define a generalized Bellman operator for the squared-return, using this above recursive form. The goal is to obtain the fixed point , where a fixed point exists if the operator is a contraction. For the first moment, the Bellman operator is known to be a contraction tsitsiklis1997ananalysis . This result, however, does not immediately extend here because, thought is a valid finite reward, does not satisfy , because can be large.

We can nonetheless define such a Bellman operator for the -squared-return and determine if a fixed point exists. Interestingly, can in fact be larger than , and we can still obtain a contraction. To define the Bellman operator, we use a recent generalization that enables the discount to be defined as a function of white2016transition , rather than just as a function of . We first define , the expected -squared-return

where

(3) | ||||

Using similar equations to the generalized Bellman operator white2016transition , we can define

where is a matrix with on the diagonal, for all

. The infinite sum is convergent if the maximum singular value of

is less than 1, giving solution . Otherwise, however, the value is infinite and one can see that in fact the variance of the return is infinite!We can naturally investigate when the second moment of the return is guaranteed to be finite. This condition on should facilitate identifying theoretical conditions on the target and behavior policies that enable finite variance of the return. This theoretical characterization is outside of the scope of this work, but we can reason about different settings that provide a well-defined, finite fixed point. First, clearly setting for every state ensures a finite second moment, given a finite , regardless of policy mis-match. For the on-policy setting, where , and so a well-defined fixed point exists, under standard assumptions (see white2016transition ). For the off-policy setting, if , this is similarly the case. Otherwise, a solution may still exist, by ensuring that the maximum singular value of is less than one; we hypothesize that this property is unlikely if there is a large mis-match between the target and behavior policy, causing many large . An important future avenue is to understand the required similarity between and to enable finite variance of the return, for any given . Interestingly, the -greedy algorithm should adapt to such infinite variance settings, where (2) will set .

### 5.2 VTD derivation

In this section, we propose Var-MSPBE, the mean-squared projected Bellman error (MSPBE) objective for the -squared-return, and derive VTD to optimize this objective. Given the definition of the generalized Bellman operator , the derivation parallels GTD() for the first moment maei2011gradient . The main difference is in obtaining unbiased estimates of parts of the objective; we will therefore focus the results on this novel aspect, summarized in the below two theorems and corollary.

Define the error of the estimate to the future -squared-return

and, as in previous work sutton2009fast ; maei2011gradient , we define the MSPBE that corresponds to

To obtain the gradient of the objective, we prove that we can obtain an unbiased sample of (a forward view) using a trace of the past (a backward view). The equivalence is simpler if we assume that we have access to an estimate of the first moment of the -return. For our setting, we do in fact have such an estimate, because we simultaneously learn . We include the more general expectation equivalence in Theorem 2, with all proofs in the appendix.

###### Theorem 1

For a given unbiased estimate of

,
define

Then

###### Theorem 2

where

###### Corollary 3

For for all ,

where

To derive the VTD algorithm, we take the gradient of the Var-MSPBE. As this again parallels GTD(), we include the derivation in the appendix for completeness and provide only the final result here.

As with previous gradient TD algorithms, we will learn an auxiliary set of weights to estimate a part of this objective: . To obtain such an estimate, notice that corresponds to an LMS solution, where the goal is to obtain that estimates . Therefore, we can use an LMS update for , giving the final set of update equations for VTD:

For -greedy, we set , causing the term with the auxiliary weights to be multiplied by , and so removing the need to approximate .

## 6 Related work

There has been a significant effort to empirically investigate , typically using batch off-line computing and model-based techniques. Sutton and Singh sutton1994onstep investigated tuning both and . They proposed three algorithms, the first two assume the underlying MDP has no cycles, and the third makes use of an estimate of the transition probabilities and is thus of most interest in tabular domains. Singh and Dayan singh1996analytical provided analytical expression for bias and variance, given the model. They suggest that there is a largest feasible step-size , below which bias converges to zero and variance converges to a non-zero value, and above which bias and/or variance may diverge. Downey and Sanner downey2010temporal used a Bayesian variant of TD learning, requiring a batch of samples and off-line computation, but did provide an empirical demonstration off optimally setting after obtaining all the samples. Kearns and Singhkearns2000bias compute a bias-variance error bound for a modification of TD called phased TD. In each discrete phase the algorithm is given trajectories from each state. Because we have trajectories in each state the effective learning rate is removing the complexities of sample averaging in the conventional online TD-update. The error bounds are useful for, among other things, computing a new value for each phase which outperforms any fixed value, empirically demonstrating the utility of changing .

There has also been a significant effort to theoretically characterizing . Most notably, the work of Schapire and Warmuth schapire1996ontheworst contributed a finite sample analysis of incremental TD-style algorithms. They analyze a variant of TD called TD, which although still linear and incremental, computes value estimates quite differently. The resulting finite sample bound is particularly interesting, as it does not rely on model assumptions, using only access to a sequence of feature vectors, rewards and returns. Unfortunately, the bound cannot be analytically minimized to produce an optimal value. They did simulate their bound, further verifying the intuition that should be larger if the best linear predictor is inaccurate, small if accurate and an intermediate value otherwise. Li li2008aworst later derived similar bounds for another gradient descent algorithm, called residual gradient. This algorithm, however, does not utilize eligibility traces and converges to a different solution than TD methods when function approximation is used sutton2009fast .

Another approach involves removing the parameter altogether, in an effort to improve robustness. Konidaris et al. konidaris2011td introduced a new TD method called TD. Their work defines a plausible set of assumptions implicitly made when constructing the -returns, and then relaxes one of those assumptions. They derive an exact (but computationally expensive) algorithm, TD, that no longer depends on a choice of and performs well empirically in a variety of policy learning benchmarks. The incremental approximation to TD also performs reasonably well, but appears to be somewhat sensitive to the choice of meta parameter , and often requires large values to obtain good performance. This can be problematic, as the complexity grows as , where is the length of the trajectories—not linearly in the feature vector size. Nonetheless, TD constitutes a reasonable way to reduce parameter sensitivity in the on-policy setting. Garcia and Serregarcia2001from proposed a variant of Q-learning, for which the optimal value of can be computed online. Their analysis, however, was restricted to the tabular case. Finally, Mahmood et al. mahmood2014weighted introduced weighted importance sampling for off-policy learning; though indirect, this is a strategy for enabling larger to be selected, without destabilizing off-policy learning.

This related work has helped shape our intuition on the role of , and, in special cases, provided effective strategies for adapting