Context-Dependent Upper-Confidence Bounds for Directed Exploration

by   Raksha Kumaraswamy, et al.
University of Alberta

Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like ϵ-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches -- because they summarize past interactions -- with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-dependent (or state-dependent) noise variance. Such context-dependent noise focuses exploration on a subset of variable states, and allows for reduced exploration in other states. We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values.


page 1

page 2

page 3

page 4


UCB Exploration via Q-Ensembles

We show how an ensemble of Q^*-functions can be leveraged for more effec...

Information-Directed Exploration for Deep Reinforcement Learning

Efficient exploration remains a major challenge for reinforcement learni...

DORA The Explorer: Directed Outreaching Reinforcement Action-Selection

Exploration is a fundamental aspect of Reinforcement Learning, typically...

Model-Based Active Exploration

Efficient exploration is an unsolved problem in Reinforcement Learning. ...

Optimistic Temporal Difference Learning for 2048

Temporal difference (TD) learning and its variants, such as multistage T...

Optimistic Exploration even with a Pessimistic Initialisation

Optimistic initialisation is an effective strategy for efficient explora...

Directed Information Graphs

We propose a graphical model for representing networks of stochastic pro...

1 Introduction

Exploration is crucial in reinforcement learning, as the data gathering process significantly impacts the optimality of the learned policies and values. The agent needs to balance the amount of time taking exploratory actions to learn about the world, versus taking actions to maximize cumulative rewards. If the agent explores insufficiently, it could converge to a suboptimal policy; exploring too conservatively, however, results in many suboptimal decisions. The goal of the agent is data-efficient exploration: to minimize how many samples are wasted in exploration, particularly exploring parts of the world that are known, while still ensuring convergence to the optimal policy.

To achieve such a goal, directed exploration strategies are key. Undirected strategies, where random actions are taken such as in -greedy, are a common default. In small domains these methods are guaranteed to find an optimal policy (Singh et al., 2000)

, because the agent is guaranteed to visit the entire space—but may take many many steps to do so, as undirected exploration can interfere with improving policies in incremental control. In this paper we explore the idea of constructing confidence intervals around the agent’s value estimates. The agent can use these learned confidence intervals to select actions with the highest upper confidence bound ensuring actions selected are of high value or whose values are highly uncertain. This optimistic approach is promising for directed exploration, but as yet there are few such methods that are model-free, incremental and computationally efficient.

Directed exploration strategies have largely been explored under the framework of “optimism in the face of uncertainty” Kaelbling et al. (1996). These can generally be categorized into count-based approaches and confidence-based approaches. Count-based approaches estimate the “known-ness” of a state, typically by maintaining counts for finite state-spaces (Kearns and Singh, 2002; Brafman and Tennenholtz, 2003; Strehl and Littman, 2004; Strehl et al., 2006; Szita and Szepesvari, 2010) and extensions on counting for continuous states (Kakade et al., 2003; Jong and Stone, 2007; Nouri and Littman, 2009; Li et al., 2009; Pazis and Parr, 2013; Kawaguchi, 2016; Ostrovski et al., 2017; Martin et al., 2017). Confidence interval estimates, on the other hand, depend on variance of the target, not just on visitation frequency for states. Confidence-based approaches can be more data-efficient for exploration, because the agent can better direct exploration where the estimates are less accurate. The majority of confidence-based approaches compute confidence intervals on model parameters, both for finite state-spaces (Kaelbling, 1993; Wiering and Schmidhuber, 1998; Kearns and Singh, 2002; Brafman and Tennenholtz, 2003; Auer and Ortner, 2006; Bartlett and Tewari, 2009; Jaksch et al., 2010; Szita and Szepesvari, 2010; Osband et al., 2013) and continuous state-spaces (Jung and Stone, 2010; Ortner and Ryabko, 2012; Grande et al., 2014; Abbasi-Yadkori and Szepesvari, 2014; Osband and Van Roy, 2017). There is early work quantifying uncertainty for value estimates directly for finite state-spaces (Meuleau and Bourgine, 1999), describing the difficulties with extending the local measures of uncertainty from the bandit literature to RL, since there are long-term dependencies.

These difficulties suggest why using confidence intervals directly on value estimates for exploration in RL has been less explored, until recently. More approaches are now being developed that maintain confidence intervals on the value function for continuous state-spaces, by maintaining a distribution over value functions (Grande et al., 2014; Osband et al., 2016b), or by maintaining a randomized set of value functions from which to sample (White and White, 2010; Osband et al., 2016b, a; Plappert et al., 2017; Moerland et al., 2017). Though significant steps forward, these approaches have limitations particularly in terms of computational efficiency. Delayed Gaussian Process Q-learning (DGPQ) (Grande et al., 2014)

requires updating two Gaussian processes, which is cubic in the number of basis vectors for the Gaussian process. RLSVI

(Osband et al., 2016b)

is relatively efficient, maintaining a Gaussian distribution over parameters with Thompson sampling to get randomized values. Their staged approach for finite-horizon problems, however, does not allow for value estimates to be updated online, as the value function is fixed per episode to gather an entire trajectory of data.

Moerland et al. (2017), on the other hand, sample a new parameter vector from the posterior distribution each time an action is considered, which is expensive. The bootstrapping approaches can be efficient, as they simply have to store several value functions, either for training on a bootstrapped subset of samples—such as in Bootstrapped DQN (Osband et al., 2016a)—or for maintaining a moving bootstrap around the changing parameters themselves, for UCBootstrap (White and White, 2010). For both of these approaches, however, it is unclear how many value functions would be required, which could be large depending on the problem.

In this paper, we provide an incremental, model-free exploration algorithm with fast converging upper confidence bounds, called UCLS: Upper-Confidence Least-Squares. We derive the upper confidence bounds for Least-Squares Temporal Difference learning (LSTD), taking advantage of the fact that LSTD has an efficient summary of past interaction to facilitate computation of confidence intervals. Importantly, these upper confidence bounds have context-dependent variance, where variance is dependent on state rather than a global estimate, focusing exploration on states with higher-variance. Computing confidence intervals for action-values in RL has remained an open problem, and we provide the first theoretically sound result for obtaining upper confidence bounds for policy evaluation under function approximation, without making strong assumptions on the noise. We demonstrate in several simulated domains that UCLS outperforms DGPQ, UCBootstrap, and RLSVI. We also empirically show the benefit of using UCLS to a simplified version that uses a global variance estimate, rather than context-dependent variance.

2 Background

We focus on the problem of learning an optimal policy for a Markov decision process, from on-policy interaction. A Markov decision process consists of

where is the set of states; is the set of actions;

provides the transition probabilities;

is the reward function; and is the transition-based discount function which enables either continuing or episodic problems to be specified (White, 2017). On each step, the agent selects action in state , and transitions to , according to , receiving reward and discount . For a policy , where , the value at a given state , taking action , is the expected discounted sum of future rewards, with actions selected according to into the future,

For problems in which can be stored in a table, a fixed point for the action-values exists for a given . In most domains, must be approximated by , parametrized by .

In the case of linear function approximation, state-action features are used to approximate action-values . The weights can be learned with a stochastic approximation algorithm, called temporal difference (TD) learning (Sutton, 1988). The TD update (Sutton, 1988) processes samples one at a time, , with for . The eligibility trace facilitates multi-step updates via an exponentially weighted memory of previous feature activations decayed by and . Alternatively, we can directly compute the weight vector found by TD using least-squares temporal difference learning (LSTD) (Bradtke and Barto, 1996). The LSTD solution is more data-efficient, and can avoid the need to tune TD’s stepsize parameter . The LSTD update can be efficiently computed incrementally without approximation or storing the data (Bradtke and Barto, 1996; Boyan, 2002), by maintaining a matrix and vector ,


The value function approximation at time step is the weights that satisfy the linear system . In practice, the inverse of the matrix is maintained using a Sherman-Morrison update, with a small regularizer added to the matrix to guarantee invertibility (Szepesvari, 2010).

One approach to ensure systematic exploration is to initialize the agent’s value estimates optimistically. The action-value function must be initialized to predict the maximum possible return (or greater) from each state and action. For example, for cost-to-goal problems, with -1 per step, the values can be initialized to zero. For continuing problems, with constant discount , the values can be initialized to , if the maximum reward

is known. For fixed features that are non-negative and encode locality—such as tile coding or radial basis functions—the weights

can be simply set to , to make optimistic.

More generally, however, it can be problematic to use optimistic initialization. Optimistic initialization assumes the beginning of time is special—a period when systematic exploration should be performed after which the agent should more or less exploit its current knowledge. Many problems are non-stationary—or at least benefit from a tracking approach due to aliasing caused by function approximation—and benefit from continual exploration. Further, unlike for fixed features, it is unclear how to set and maintain initial values at

for learned features, such as with neural networks. Optimistic initialization is also not straightforward for algorithms like LSTD, which completely overwrite the estimate

on each step with a closed-form solution. In fact, we have found that this issue with LSTD has been obfuscated, because the regularizer has inadvertently played a role in providing optimism (see Appendix A). Rather, to use optimism in LSTD for control, we need to explicitly compute upper confidence bounds.

Confidence intervals around action-values, then, provide another mechanism for exploration in reinforcement learning. Consider action selection with explicit confidence intervals around mean estimates , with estimated radius . The action selection is greedy w.r.t. to these optimistic values, , which provides a high-confidence upper bound on the best possible value for that action. The use of upper confidence bounds on value estimates for exploration has been well-studied and motivated theoretically in online learning (Chu et al., 2011). In reinforcement learning, there have only been a few specialized proofs for particular algorithms using optimistic estimates (Grande et al., 2014; Osband et al., 2016b), but the result can be expressed more generally by using the idea of stochastic optimism. We extract the central argument by Osband et al. (2016b) to provide a general Optimistic Values Theorem in Appendix B. In particular, similar to online learning, we can guarantee that greedy-action selection according to upper confidence values will converge to the optimal policy, if the confidence interval radius shrinks to zero, if the algorithm to estimate action-values for a policy converges to the corresponding actions and if upper confidence estimates are stochastically optimal—remain above the optimal action-values in expectation.

Motivated by this result, we pursue principled ways to compute upper confidence bounds for the general, online reinforcement learning setting. We make a step towards computing such values incrementally, under function approximation, by providing upper confidence bounds for value estimates made by LSTD, for a fixed policy. We approximate these bounds to create a new algorithm for control—called Upper-Confidence-Least-Squares (UCLS).

3 Estimating Upper Confidence Bounds for Policy Evaluation using LSTD

Consider the goal of obtaining a confidence interval around value estimates learned incrementally by LSTD for a fixed policy . The value estimate is for state-action features for the current state and action. We would like to guarantee, with probability for a small , that the confidence interval around this estimate contains the value given by the optimal . To estimate such an interval without parametric assumptions, we use Chebyshev’s inequality which—unlike other concentration inequalities like Hoeffding or Bernstein—does not require independent samples.

To use this inequality, we need to determine the variance of the estimate ; the variance of the estimate, given , is due to the variance of the weights. Let be fixed point solution for the projected Bellman operator for the -return—the TD fixed point, for a fixed policy . To characterize the noise for this optimal estimator, let be the TD-error for the optimal weights , where


The expectation is taken across all states weighted by the sampling distribution, typically the stationary distribution or in the off-policy case the stationary distribution of the behaviour policy. We know that , by the definition of the Projected Bellman Error fixed point.

This noise

is incurred from the variability in the reward, the variability in the transition dynamics and potentially the capabilities of the function approximator. A common assumption—when using linear regression for contextual bandits

(Li et al., 2010) and for reinforcement learning (Osband et al., 2016b)—is that the variance of the target is a constant value for all contexts . Such an assumption, however, is likely to produce larger confidence intervals than necessary. For example, consider a one-state world with two actions, where one action has a high variance reward and the other has a lower variance reward (see Appendix A, Figure 4). A global sample variance will encourage both actions to be taken many times. For data-efficient exploration, however, the agent should take the high-variance action more, and only needs a few samples from the low-variance action.

We derive a confidence interval for LSTD, in Theorem 1. We also derive the confidence interval assuming a global variance in Corollary 1, to provide a comparison. We compare to using this global-variance upper confidence bound in our experiments, and show that it results in significantly worse performance than using a context-dependent variance. Note that we do not assume is invertible; if we did, the big-O term in (C) below would disappear. We include this term for preciseness of the result—even though we will not estimate it—because for smaller , is unlikely to be invertible. However, we expect this big-O term to get small quickly, and be dominated by the other terms. In our algorithm, therefore, we ignore the big-O term.

Theorem 1.

Let and where is the pseudoinverse of . Let reflect the degree to which is not invertible; it is zero when is invertible. Assume that the following are all finite: , and all state-action features . With probability at least , given state-action features ,


Proof:  First we compute the mean and variance for our learned parameters. Because ,

This estimate has a small amount of bias, that vanishes asymptotically. But, for a finite sample,

Further, because may not be invertible, there is an additional error term which will vanish with enough samples, i.e., once can be guaranteed to be invertible.

For covariance, because

the covariance of the weights is

The goal for computing variances is to use a concentration inequality. Chebyshev’s inequality111Bernstein’s inequality cannot be used here because we do not have independent samples. Rather, we characterize behaviour of the random variable , using variance of , but cannot use bounds that assume is the sum of independent random variables. The bound with Chebyshev will be loose, but we can better control the looseness of the bound with the selection of and the constant in front of the square root.

states that for a random variable

, if the and are bounded, then for any :

If we set , then this gives

Now we have characterized the variance of the weights, but what we really want is to characterize the variance of the value estimates. Notice that the variance of the value-estimate, for state-action is

Therefore, the variance of the estimate is characterized by the variance of the weights. With high probability,


where Equation 4 uses Chebyshev’s inequality, and the last step is a rewriting of Equation 4 using the definitions and .

To simplify (5), we need to determine an upper bound for the general formula where . Because , we know that . Therefore, the extremal points for , and , both result in an upper bound of . Taking the derivative of the objective, gives a single stationary point in-between , with . The value at this point evaluates to be . Therefore, this objective is upper-bounded by .

Now for , the term involving should quickly disappear, since it is only due to the potential lack of invertibility of . This term is equal to , which results in the additional in the bound.

Corollary 1.

Assume that are i.i.d., with mean zero and bounded variance . Let and assume that the following are finite: , , and all state-action features . With probability at least , given state-action features ,


Proof:  The result follows similarly to above, with some simplifications due to global-variance:

4 UCLS: Estimating upper confidence bounds for LSTD in control

In this section, we present Upper Confidence Least Squares (UCLS)222We do not characterize the regret of UCLS, and instead similarly to policy iteration, rely on a sound update under a fixed policy to motivate incrementally estimating these values as if the policy is fixed and then acting according to them. The only model-free algorithm that achieves a regret bound is RLSVI, but that bound is restricted to the finite horizon, batch, tabular setting. It would be a substantial breakthrough to provide such a regret bound, and is beyond the scope of this work., a control algorithm, which incrementally estimates the upper confidence bounds provided in Theorem 1, for guiding on-policy exploration. The upper confidence bounds are sound without requiring i.i.d. assumptions; however, they are derived for a fixed policy. In control, the policy is slowly changing, and so instead we will be slowly tracking this upper bound. The general strategy, like policy iteration, is to slowly estimate both the value estimates and the upper confidence bounds, under a changing policy that acts greedily with respect to the upper confidence bounds. Tracking these upper bounds incurs some approximations; we identify and address potential issues here. The complete psuedocode for UCLS is given in the Appendix (Algorithm 2).

First, we are not evaluating one fixed policy; rather, the policy is changing. The estimates and will therefore be out-of-date. As is common for LSTD with control, we use an exponential moving average, rather than a sample average, to estimate , and the upper confidence bound. The exponential moving average uses , for some . If , then this reduces to the standard sample average; otherwise, for a fixed , such as , more recent samples have a higher weight in the average. Because an exponential average is unbiased, the result in Theorem 1 would still hold, and in practice the update will be more effective for the control setting.

Second, we cannot obtain samples of the noise , which is the TD-error for the optimal value function parameters (see Equation (2)). Instead, we use as a proxy. This proxy results in an upper bound that is too conservative—too loose—because is likely to be larger than . This is likely to ensure sufficient exploration, but may cause more exploration than is needed. The moving average update


should also help mitigate this issue, as older are likely larger than more recent ones.

Third, the covariance matrix estimating

could underestimate covariances, depending on a skewed distribution over states and depending on the initialization. This is particularly true in early learning, where the distribution over states is skewed to be higher near the start state; a sample average can result in underestimates in as yet unvisited parts of the space. To see why, let

. The covariance estimate corresponds to feature and . The agent begins in a certain region of the space, and so features that only become active outside of this region will be zero, providing samples . As a result, the covariance is artificially driven down in unvisited regions of the space, because the covariance accumulates updates of 0. Further, if the initialization to the covariance is an underestimate, a visited state with high variance will artificially look more optimistic than an unvisited state.

We propose two simple approaches to this issue: updating based on locality and adaptively adjusting the initialization to . Each covariance estimate for features and should only be updated if the sampled outer-product is relevant, with the agent in the region where and are active. To reflect this locality, each is updated with the only if the eligibility traces is non-zero for and . To adaptively update the initialization, the maximum observed is stored, as , and the initialization to each is retroactively updated using

where is the number of times has been updated. This update is equivalent to having initialized . We provide a more stable retroactive update to , in the pseudocode in Algorithm 2, that is equivalent to this update.

Fourth, to improve the computational complexity of the algorithm, we propose an alternative, incremental strategy for estimating , that takes advantage of the fact that we already need to estimate the inverse of for the upper bound. In order to do so, we make use of the summarized information in to improve the update, but avoid directly computing as it may be poorly conditioned. Instead, we maintain an approximation that uses a simple gradient descent update, to minimize . If is the inverse of , then this loss is zero; otherwise, minimizing it provides an approximate inverse. This estimate is useful for two purposes in the algorithm. First, it is clearly needed to estimate the upper confidence bound. Second, it also provides a pre-conditioner for the iterative update , for preconditioner . The optimal preconditioner is in fact the inverse of , if it exists. We use for a small to ensure that the preconditioner is full rank. Developing this stable update for LSTD required significant empirical investigation into alternatives; in addition to providing a more practical UCLS algorithm, we hope it can improve the use of LSTD in other applications.

5 Experiments

We conducted several experiments to investigate the benefits of UCLS’ directed exploration against other methods that use confidence intervals for action selection, to evaluate sensitivity of UCLS’s performance with respect to its key parameter , and to contrast the advantage contextual variance estimates offer over global variance estimates in control. Our experiments were intentionally conducted in small—though carefully selected—simulation domains so that we could conduct extensive parameter sweeps, hundreds of runs for averaging, and compare numerous state-of-the-art exploration algorithms (many of which are computationally expensive on larger domains). We believe that such experiments constitute a significant contribution, because effectively using confidence bounds for model free-exploration in RL is still in its infancy—not yet at the large-scale demonstration state–with much work to be done. This point is highlighted nicely below as we demonstrate that several recently proposed exploration methods fail on these simple domains.

5.1 Algorithms

We compare UCLS to DGPQ (Grande et al., 2014), UCBootstrap (White and White, 2010), our extension of LSPI-Rmax to an incremental setting (Li et al., 2009) and RLSVI (Osband et al., 2016b). In-depth descriptions of each algorithm and implementation details can be found in the Appendix. These algorithms are chosen because they either keep confidence intervals explicitly, as in UCBootstrap, or implicitly as in DGPQ and RLSVI. In addition, we included LSPI-Rmax as a natural alternative approach to using LSTD to maintain optimistic value estimates.

We also include Sarsa with -greedy, with optimized over an extensive parameter sweep. Though

-greedy is not a generally practical algorithm, particularly in larger worlds, we include it as a baseline. We do not include Sarsa with optimistic initialization, because even though it has been a common heuristic, it is not a general strategy for exploration. Optimistic initialization can converge to suboptimal solutions if initial optimism fades too quickly

(White and White, 2010). Further, initialization only happens once, at the beginning of learning. If the world changes, then an agent relying on systematic exploration due to its initialization may not react, because it no longer explores. For completeness comparing to previous work using optimistic initialization, we include such results in Appendix G.

Figure 1: A comparison of speed of learning in Sparse Mountain Car, Puddle World and River Swim. In plots (a) and (b) lower on y-axis are better, whereas in (c) curves higher along y-axis are better. Sparse Mountain Car and Puddle World are episodic problems with a fixed experience budget. Thus the length of the lines in plots (a) and (b) indicate how many episodes each algorithm completed over 50,000 steps, and the height on the y-axis indicates the quality of the learned policy—lower indicates better performance. Note RLSVI did not show significant learning after 50,000 steps. The RLSVI result in Puddle World uses a budget of 1 million.

5.2 Environments

Sparse Mountain Car is a version of classic mountain car problem Sutton and Barto (1998), only differing in the reward structure. The agent only receives a reward of at the goal and otherwise, and a discounted, episodic of . The start state is sampled from the range with velocity zero. This domain is used to highlight how exploration techniques perform when the reward signal is sparse, and thus initializing the value function to zero is not optimistic.

Puddle World is a continuous state 2-dimensional world with with 2 puddles: (1) to , and (2) to - with radius 0.1 and the goal is the region . The agent receives a reward of on each time step, where denotes the distance between the agent’s position and the center of the puddle, and an undiscounted, episodic of . The agent can select an action to move , . The agent’s initial state is uniformly sampled from . This domain highlights a common difficulty for traditional exploration methods: high magnitude negative rewards, which often cause the agent to erroneously decrease its value estimates too quickly.

River Swim is a standard continuing exploration benchmark Szita and Lorincz (2008) inspired by a fish trying to swim upriver, with high reward (+1) upstream which is difficult to reach and, a lower but still positive reward (+0.005), which is easily reachable downstream. We extended this domain to continuous states in , with a stochastic displacement of when taking an action up or down, with low-probability of success for up. The starting position is sampled uniformly in , and .

5.3 Experimental Setup

We investigate a learning regime where the agents are allowed a fixed budget of interaction steps with the environment, rather than allowing a finite number of episodes of unlimited length. Our primary concern is early learning performance, thus each experiment is restricted to 50,000 steps, with an episode cutoff (in Sparse Mountain Car and Puddle World) at 10,000 steps. In this regime, an agent that spends a significant time exploring the world during the first episode may not be able to complete many episodes, the cutoff makes exploration easier given the strict budget on experience. Whereas, in the more common framework of allowing a fixed number of episodes, an agent can consume many steps during the first few episodes exploring, which is difficult to detect in the final performance results. We average over 100 runs in River Swim and 200 runs for the other domains . For all the algorithms that utilize eligibility traces we set to be 0.9. For algorithms which use exponential averaging, is set to 0.001, and the regularizer is set to be 0.0001. The parameters for UCLS are fixed. RLSVI’s weights are recalculated using all experienced transitions at the beginning of an episode in Puddle World and Sparse Mountain Car, and every 5,000 steps in River Swim. The parameters of competitors, where necessary, are selected as the best from a large parameter sweep.

All the algorithms except DGPQ use the same representation: (1) Sparse Mountain Car - 8 tilings of 8x8, hashed to a memory space of 512, (2) River Swim - 4 tilings of granularity 32, hashed to a memory space of 128, and (3) Puddle World - 5 tilings of granularity 5x5, hashed to a memory space of 128. DGPQ uses its own kernel-based representation with normalized state information.

5.4 Results & Analysis

Our first experiment simply compares UCLS against other control algorithms in all the domains. Figure 1 shows the early learning results across all three domains. In all three domains UCLS achieves the best final performance. In Sparse Mountain Car, UCLS learns faster than the other methods, while in River Swim DGPQ learns faster initially. UCBootstrap and UCLS learn at a similar rate in Puddle World, which is a cost-to-goal domain. UCBootstrap, and bootstrapping approaches generally, can suffer from insufficient optimism, as they rely on sufficiently optimistic or diverse initialization strategies (White and White, 2010; Osband et al., 2016a). LSPI-Rmax and RLSVI do not perform well in any of the domains. DGPQ does not perform as well as UCLS in Puddle World, and exhibits high variance compared with the other methods. In Puddle World, UCLS goes on to finish 1200 episodes in the alloted budget of steps, whereas in River Swim both UCLS and DGPQ get close to the optimal policy by the end of the experiment.

The DGPQ algorithm uses the maximum reward (Rmax) to initialize the Gaussian processes. In Sparse Mountain Car this effectively converts the problem back into the traditional -1 per-step formulation. In this traditional variant of Mountain Car UCLS significantly outperforms DGPQ (Appendix G). Sarsa with -greedy learns well in Puddle world as it is a cost-to-goal problem in which by default Sarsa uses optimistic initialization, and therefore is reported in the Appendix. .

Figure 2: The effect of the confidence parameter on the policy, in River Swim, using context-dependent variance (UCLS) and global variance (GV-UCB). The values for are .

Next we investigated the impact of the confidence level , on the performance of UCLS in River Swim. The confidence interval radius is proportional to ; smaller should correspond to a higher rate of exploration. In Figure 2, smaller resulted in a slower convergence rate, but all values eventually reach the optimal policy.

Finally, we investigate the benefit using contextual variance estimates over global variance estimates within UCLS. In Figure 2, we also show the effect of various values on the performance of the algorithm resulting from Corollary 1, which we call Global Variance-UCB (GV-UCB) (see Appendix E.1 for more details about this algorithm). For this range of , UCLS still converges to the optimal policy, albeit at different rates. Using a global variance estimates (GV-UCB), on the other hand, results in significant over-estimates of variance, resulting in poor performance.

6 Conclusion and Discussion

This paper develops a sound upper confidence bound on the value estimates for least-squares temporal difference learning (LSTD), without making i.i.d. assumptions about noise distributions. In particular, we allow for context-dependent noise, where variability could be due to noise in rewards, transition dynamics or even limitations of the function approximator. We then introduce an algorithm, called UCLS, that estimates these upper confidence bounds incrementally, for policy iteration. We demonstrate empirically that UCLS requires far fewer exploration steps to find high-quality policies compared to several baselines, across domains chosen to highlight different exploration difficulties.

The goal of this paper is to provide an incremental, model-free, data-efficient, directed exploration strategy. The upper confidence bounds for action-values for fixed policies are one of the few available under function approximation, and so a step towards exploration with optimistic values in the general case. A next step is to theoretically show that using these upper bounds for exploration ensures stochastic optimism, and so converges to optimal policies.

One promising aspect of UCLS is that it uses least-squares to efficiently summarize past experience, but is not tied to a specific state representation. Though we considered a fixed representation for UCLS, it is feasible that an analysis for the non-stationary case could be used as well for the setting where the representation is being adapted over time. If the representation drifts slowly, then UCLS may be able to similarly track the upper confidence bounds. Recent work has shown that combining deep Q-learning with Least-squares can result in significant performance gains over vanilla DQN(Levine et al., 2017). We expect that combining deep networks and UCLS could result in even larger gains, and is a natural direction for future work.

7 Acknowledgements

We would like to thank Bernardo Ávila Pires, and Jian Qian for their helpful comments.


  • Abbasi-Yadkori and Szepesvari [2014] Y. Abbasi-Yadkori and C. Szepesvari. Bayesian Optimal Control of Smoothly Parameterized Systems: The Lazy Posterior Sampling Algorithm. In

    Uncertainty in Artificial Intelligence

    , 2014.
  • Auer and Ortner [2006] P. Auer and R. Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning. Advances in Neural Information Processing Systems, 2006.
  • Bartlett and Tewari [2009] P. L. Bartlett and A. Tewari. REGAL - A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs. In Conference on Uncertainty in Artificial Intelligence, 2009.
  • Boyan [2002] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine learning, 49(2-3):233–246, 2002.
  • Bradtke and Barto [1996] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996.
  • Brafman and Tennenholtz [2003] R. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 2003.
  • Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual Bandits with Linear Payoff Functions. In International Conference on Artificial Intelligence and Statistics, 2011.
  • Grande et al. [2014] R. Grande, T. Walsh, and J. How. Sample Efficient Reinforcement Learning with Gaussian Processes. International Conference on Machine Learning, 2014.
  • Jaksch et al. [2010] T. Jaksch, R. Ortner, and P. Auer. Near-optimal Regret Bounds for Reinforcement Learning. The Journal of Machine Learning Research, 2010.
  • Jong and Stone [2007] N. Jong and P. Stone. Model-based exploration in continuous state spaces. Abstraction, Reformulation, and Approximation, 2007.
  • Jung and Stone [2010] T. Jung and P. Stone. Gaussian processes for sample efficient reinforcement learning with RMAX-like exploration. In Machine Learning: ECML PKDD, 2010.
  • Kaelbling [1993] L. P. Kaelbling. Learning in embedded systems. MIT press, 1993.
  • Kaelbling et al. [1996] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 1996.
  • Kakade et al. [2003] S. Kakade, M. Kearns, and J. Langford. Exploration in metric state spaces. In International Conference on Machine Learning, 2003.
  • Kawaguchi [2016] K. Kawaguchi. Bounded Optimal Exploration in MDP. In AAAI Conference on Artificial Intelligence, 2016.
  • Kearns and Singh [2002] M. J. Kearns and S. P. Singh. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning, 2002.
  • Lagoudakis and Parr [2003] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 2003.
  • Levine et al. [2017] N. Levine, T. Zahavy, D. J. Mankowitz, A. Tamar, and S. Mannor. Shallow updates for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3138–3148, 2017.
  • Li et al. [2009] L. Li, M. Littman, and C. Mansley. Online exploration in least-squares policy iteration. In International Conference on Autonomous Agents and Multiagent Systems, 2009.
  • Li et al. [2010] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In World Wide Web Conference, 2010.
  • Martin et al. [2017] J. Martin, S. N. Sasikumar, T. Everitt, and M. Hutter. Count-Based Exploration in Feature Space for Reinforcement Learning. In International Joint Conference on Artificial IntelligenceI, 2017.
  • Meuleau and Bourgine [1999] N. Meuleau and P. Bourgine. Exploration of Multi-State Environments - Local Measures and Back-Propagation of Uncertainty. Machine Learning, 1999.
  • Meyer [1973] C. D. Meyer, Jr. Generalized inversion of modified matrices. SIAM Journal on Applied Mathematics, 24(3):315–323, 1973.
  • Miller [1981] K. S. Miller. On the inverse of the sum of matrices. Mathematics magazine, 54(2):67–72, 1981.
  • Moerland et al. [2017] T. M. Moerland, J. Broekens, and C. M. Jonker. Efficient exploration with Double Uncertain Value Networks. In Advances in Neural Information Processing Systems, 2017.
  • Nouri and Littman [2009] A. Nouri and M. L. Littman. Multi-resolution Exploration in Continuous Spaces. In Advances in Neural Information Processing Systems, 2009.
  • Ortner and Ryabko [2012] R. Ortner and D. Ryabko. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning. In Advances in Neural Information Processing Systems, 2012.
  • Osband and Van Roy [2017] I. Osband and B. Van Roy. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? In International Conference on Machine Learning, 2017.
  • Osband et al. [2013] I. Osband, D. Russo, and B. Van Roy. (More) Efficient Reinforcement Learning via Posterior Sampling. In Advances in Neural Information Processing Systems, 2013.
  • Osband et al. [2016a] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep Exploration via Bootstrapped DQN. In Advances in Neural Information Processing Systems, 2016a.
  • Osband et al. [2016b] I. Osband, B. Van Roy, and Z. Wen. Generalization and Exploration via Randomized Value Functions. In International Conference on Machine Learning, 2016b.
  • Ostrovski et al. [2017] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-Based Exploration with Neural Density Models. In International Conference on Machine Learning, 2017.
  • Pazis and Parr [2013] J. Pazis and R. Parr. PAC optimal exploration in continuous space Markov decision processes. In AAAI Conference on Artificial Intelligence, 2013.
  • Plappert et al. [2017] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz. Parameter Space Noise for Exploration., 2017.
  • Singh et al. [2000] S. P. Singh, T. S. Jaakkola, M. L. Littman, and C. Szepesvari. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning, 2000.
  • Strehl and Littman [2004] A. Strehl and M. Littman. Exploration via model based interval estimation. In International Conference on Machine Learning, 2004.
  • Strehl et al. [2006] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, 2006.
  • Sutton et al. [2008] R. Sutton, C. Szepesvári, A. Geramifard, and M. Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. In Conference on Uncertainty in Artificial Intelligence, 2008.
  • Sutton [1988] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988.
  • Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
  • Szepesvari [2010] C. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010.
  • Szita and Lorincz [2008] I. Szita and A. Lorincz. The many faces of optimism. In International Conference on Machine Learning, 2008.
  • Szita and Szepesvari [2010] I. Szita and C. Szepesvari. Model-based reinforcement learning with nearly tight exploration complexity bounds. In International Conference on Machine Learning, 2010.
  • van Seijen and Sutton [2015] H. van Seijen and R. Sutton. A deeper look at planning as learning from replay. In International Conference on Machine Learning, 2015.
  • White [2017] M. White. Unifying task specification in reinforcement learning. In International Conference on Machine Learning, 2017.
  • White and White [2010] M. White and A. White. Interval estimation for reinforcement-learning algorithms in continuous-state domains. In Advances in Neural Information Processing Systems, 2010.
  • Wiering and Schmidhuber [1998] M. A. Wiering and J. Schmidhuber. Efficient Model-Based Exploration. In Simulation of Adaptive Behavior From Animals to Animats, 1998.

Appendix A Issues with LSTD for control

LSTD is a more data-efficient algorithm than its incremental counterpart TD, and typically performs quite well in policy evaluation. This is primarily due to TD only using each sample once for a stochastic update with a tuned stepsize parameter. In the case of control, LSTD performs surprisingly well without -greedy exploration and lack of an optimism strategy. We highlight here the inadvertent use of the regularization parameter as a form of optimism for LSTD in control, and empirically show when this strategy fails leading us to UCLS as a sound approach in using LSTD in control.

In practice, the inverted matrix

is often directly maintained using a Sherman-Morrison update, with a small regularizer added to the matrix to guarantee invertibility [Szepesvari, 2010].

There are two objectives that can be solved when dealing with an ill-conditioned system . The most common is to use Tikohonov regularization solving, referred to here as LSTD-out.

Another approach is to solve the system

The second approach is implicitly what is solved when a Sherman-Morrison update is used for , with a small regularizer added to the matrix to guarantee invertibility. This approach is referred to here as LSTD-in. When , both approaches are solving , which may have infinitely many solutions if is not full rank. While the Tikohonov regularization strategy is more common, the second approach is useful for enabling use of the incremental Sherman-Morrison update to facilitate maintaining directly.

Another choice in regularizing the ill-conditioned system is in how decays over time. A small fixed can be used as a constant regularizer, even as the number of samples increases, because the true may be ill-conditioned. However, more regularization could also be used at the beginning and then decayed over time. The incremental Sherman-Morrison update implicitly decays proportionally to .

Figure 4: One-state world, where the optimal action (right) has high-variance; the reward here is uniformly sampled from within the set . LSTD, with and large, fails in this world, unlike the cost-to-goal problems.

We conducted an empirical study using LSTD without an -greedy exploration strategy in two domains: Mountain Car and a new One-State world. One-State world—depicted in Figure 4—simulates a typical setting where sufficient exploration is needed: one outcome with low variance and lower expected value and one outcome with high variance and higher expected value. For an algorithm that does not explore sufficiently, it is likely to settle on the suboptimal action, but more immediately rewarding low-variance outcome. This world simulates a larger continuous navigation task from White and White, 2010. We include results for both systems described above and consider a fading version (shown by -F) or a constant regularization parameter (shown by -C).

Figure 3 shows results for the four different LSTD strategies in Mountain Car. The Tikohonov regularization, with , is unable to learn an optimal policy in this domain, whereas with either constant or fading , the agent can learn an optimal policy. This is surprising, considering we use neither randomized exploration nor optimistic initialization. The parameter sensitivity curve, shown in plot c, indicates and needs to be sufficiently large as time passes in order to find an optimal policy.

Next, we show that neither regularization strategy with fading is effective in the One-State world. The optimal strategy is to take the Right action, to get an expected reward of under a higher variance for obtaining rewards. All of the LSTD variants fail for this domain, because no longer plays a role in encouraging exploration. To verify that a directed exploration strategy helps, we experiment with -greedy exploration, with , decayed by a factor of every 100 steps (shown in Figure 5). With -greedy, and small values of and , the policy converges to the optimal action, whereas it fails to with higher values of and .

These results suggest that ’s role in exploration has obscured our understanding of how to use LSTD for control. LSTD, with sufficient optimism does seem to reach optimal solutions, and unlike Sutton et al. [2008], we did not find any issues with forgetting. This further explains why there have been previous results with small for LSTD in cost-to-goal problems, that nonetheless still obtained the optimal policy [van Seijen and Sutton, 2015]. Therefore, in developing UCLS, we more explicitly add optimism to LSTD, and ensure is strictly used as a regularization parameter (to ensure well-conditioned updates).

Figure 5: -sensitivity in 1-State world with various LSTD updates. Sarsa with optimistic initialization is used as a baseline. The y-axis represents percentage optimal behaviour, where optimal behaviour is choosing to go right, in 20k steps (averaged over 30 runs). Sarsa with optimistic initialization is highly sensitive to the step-size chosen. With other stepsizes (not shown in figure), it reduces its values too quickly, and fails a significant percentage of the time. The best stepsize is chosen here to show near-optimal performance is possible in the domain.

Appendix B Optimistic Values Theorem

The use of upper confidence bounds on value estimates for exploration has been well-studied and motivated theoretically in online learning [Chu et al., 2011]. For reinforcement learning, though, there are only specialized proofs for particular algorithms using optimistic estimates [Grande et al., 2014, Osband et al., 2016b]. To better motivate and appreciate the use of upper confidence bounds for reinforcement learning, we extract the key argument from Osband et al. [2016b], which uses the idea of stochastic optimism.

Under function approximation, it may not be possible to obtain the optimal policy exactly. Instead, our criterion is to obtain the optimal policy according to the following formulation, assuming greedy-action selection from action-values. Let be the action-values for the optimal policy, under the chosen density over states and actions


This optimization does not preclude being related to the trajectory of optimal policy, but generically allows specification of any density, such as one putting all weight on a set of start states or such as one that is uniform across states and actions to ensure optimality from any point in the space. The optimal policy in this setting is the policy that corresponds to acting greedily w.r.t. ; depending on the function space , this may only be an approximately optimal policy. The design of the agent is directed towards this goal, though we do not explicitly optimize this objective.

Let be the estimated action-values plus the confidence interval radius on time step , to get the estimated upper confidence bound which the agent uses to select actions. Let be the policy induced by greedy action selection on .

Assumption 1 (Stochastic Optimism).

At some point , the action-values at every step are stochastically optimistic: , with expectation according to a specified density .

Assumption 2 (Shrinking Confidence Interval Radius).

The confidence interval radius goes to zero: for some non-negative function with .

Assumption 3 (Convergent Action Values).

The estimated action-values approach the true action-values for policy : for some non-negative function with .

These assumptions are heavily dependent on the distribution utilized to evaluate the expectation. If the expectations are w.r.t. the stationary distribution induced by the optimal policy (), it is easy to see that they could be satisfied - as the density is non-zero only for the optimal state-action pairs. In contrast, if the density is a uniform density over the space, then these assumptions may not be satisfied.

Given the three key assumptions, the theorem below is straightforward to prove. However, these three conditions are fundamental, and do not imply each other. Therefore, this result highlights what would need to be shown, to obtain the Optimistic Values Theorem. For example, Assumption 1 and 2 do not imply Assumption 3, because the confidence interval radius could decrease to zero, and still be stochastically optimistic and an over-estimate of values that correspond to a suboptimal policy. Assumption 1 and 3 do not imply Assumption 2, because could converge to the policy corresponding to acting greedily w.r.t. , but may never fade away. Then, could still be stochastically optimistic, but the policy could be suboptimal because it is acting greedily according to inaccurate, inflated estimates of value .

Theorem 2 (Optimistic Values Theorem).

Under Assumptions 1, 2 and 3,

Proof:  Consider the regret across states and actions

because by Assumption 1. By Assumptions 2 and 3,

completing the proof.

This result is intentionally abstract, where the three assumptions could be satisfied in a variety of ways. These assumptions have been verified for one algorithm, called RLSVI, under a tabular setting using a finite-horizon specification [Osband et al., 2016b], which simplifies ensuring stochastic optimism (Assumption 1). We hypothesize that the last two assumptions could be addressed with a two-timescale analysis, with confidence interval radius updating more slowly than . This would reflect an iterative approach, where the optimistic values are essentially held fixed—such as is done in Delayed Q-learning [Grande et al., 2014]—and estimated, before then adjusting the optimistic values. The updates to , then, would be updated on a faster timescale, converging to , and the upper confidence radius updating on a slower timescale.

Algorithm 1 GetOptimisticAction()
, , ,
, , ,
, , ,
initial state-action features, for any action
     Take action and observe and , and
     if  then Adjust initialization
     for  such that  do
         for  such that  do
until agent done interaction with environment
Algorithm 2 UCLS()

Appendix C Estimating Upper Confidence Bounds for Policy Evaluation using linear TD

Recall that the TD update [Sutton, 1988] processes one sample at a time as to estimate the solution to the least-squares system in an incremental manner. This is feasible as the following holds: