1 Introduction
Exploration is crucial in reinforcement learning, as the data gathering process significantly impacts the optimality of the learned policies and values. The agent needs to balance the amount of time taking exploratory actions to learn about the world, versus taking actions to maximize cumulative rewards. If the agent explores insufficiently, it could converge to a suboptimal policy; exploring too conservatively, however, results in many suboptimal decisions. The goal of the agent is dataefficient exploration: to minimize how many samples are wasted in exploration, particularly exploring parts of the world that are known, while still ensuring convergence to the optimal policy.
To achieve such a goal, directed exploration strategies are key. Undirected strategies, where random actions are taken such as in greedy, are a common default. In small domains these methods are guaranteed to find an optimal policy (Singh et al., 2000)
, because the agent is guaranteed to visit the entire space—but may take many many steps to do so, as undirected exploration can interfere with improving policies in incremental control. In this paper we explore the idea of constructing confidence intervals around the agent’s value estimates. The agent can use these learned confidence intervals to select actions with the highest upper confidence bound ensuring actions selected are of high value or whose values are highly uncertain. This optimistic approach is promising for directed exploration, but as yet there are few such methods that are modelfree, incremental and computationally efficient.
Directed exploration strategies have largely been explored under the framework of “optimism in the face of uncertainty” Kaelbling et al. (1996). These can generally be categorized into countbased approaches and confidencebased approaches. Countbased approaches estimate the “knownness” of a state, typically by maintaining counts for finite statespaces (Kearns and Singh, 2002; Brafman and Tennenholtz, 2003; Strehl and Littman, 2004; Strehl et al., 2006; Szita and Szepesvari, 2010) and extensions on counting for continuous states (Kakade et al., 2003; Jong and Stone, 2007; Nouri and Littman, 2009; Li et al., 2009; Pazis and Parr, 2013; Kawaguchi, 2016; Ostrovski et al., 2017; Martin et al., 2017). Confidence interval estimates, on the other hand, depend on variance of the target, not just on visitation frequency for states. Confidencebased approaches can be more dataefficient for exploration, because the agent can better direct exploration where the estimates are less accurate. The majority of confidencebased approaches compute confidence intervals on model parameters, both for finite statespaces (Kaelbling, 1993; Wiering and Schmidhuber, 1998; Kearns and Singh, 2002; Brafman and Tennenholtz, 2003; Auer and Ortner, 2006; Bartlett and Tewari, 2009; Jaksch et al., 2010; Szita and Szepesvari, 2010; Osband et al., 2013) and continuous statespaces (Jung and Stone, 2010; Ortner and Ryabko, 2012; Grande et al., 2014; AbbasiYadkori and Szepesvari, 2014; Osband and Van Roy, 2017). There is early work quantifying uncertainty for value estimates directly for finite statespaces (Meuleau and Bourgine, 1999), describing the difficulties with extending the local measures of uncertainty from the bandit literature to RL, since there are longterm dependencies.
These difficulties suggest why using confidence intervals directly on value estimates for exploration in RL has been less explored, until recently. More approaches are now being developed that maintain confidence intervals on the value function for continuous statespaces, by maintaining a distribution over value functions (Grande et al., 2014; Osband et al., 2016b), or by maintaining a randomized set of value functions from which to sample (White and White, 2010; Osband et al., 2016b, a; Plappert et al., 2017; Moerland et al., 2017). Though significant steps forward, these approaches have limitations particularly in terms of computational efficiency. Delayed Gaussian Process Qlearning (DGPQ) (Grande et al., 2014)
requires updating two Gaussian processes, which is cubic in the number of basis vectors for the Gaussian process. RLSVI
(Osband et al., 2016b)is relatively efficient, maintaining a Gaussian distribution over parameters with Thompson sampling to get randomized values. Their staged approach for finitehorizon problems, however, does not allow for value estimates to be updated online, as the value function is fixed per episode to gather an entire trajectory of data.
Moerland et al. (2017), on the other hand, sample a new parameter vector from the posterior distribution each time an action is considered, which is expensive. The bootstrapping approaches can be efficient, as they simply have to store several value functions, either for training on a bootstrapped subset of samples—such as in Bootstrapped DQN (Osband et al., 2016a)—or for maintaining a moving bootstrap around the changing parameters themselves, for UCBootstrap (White and White, 2010). For both of these approaches, however, it is unclear how many value functions would be required, which could be large depending on the problem.In this paper, we provide an incremental, modelfree exploration algorithm with fast converging upper confidence bounds, called UCLS: UpperConfidence LeastSquares. We derive the upper confidence bounds for LeastSquares Temporal Difference learning (LSTD), taking advantage of the fact that LSTD has an efficient summary of past interaction to facilitate computation of confidence intervals. Importantly, these upper confidence bounds have contextdependent variance, where variance is dependent on state rather than a global estimate, focusing exploration on states with highervariance. Computing confidence intervals for actionvalues in RL has remained an open problem, and we provide the first theoretically sound result for obtaining upper confidence bounds for policy evaluation under function approximation, without making strong assumptions on the noise. We demonstrate in several simulated domains that UCLS outperforms DGPQ, UCBootstrap, and RLSVI. We also empirically show the benefit of using UCLS to a simplified version that uses a global variance estimate, rather than contextdependent variance.
2 Background
We focus on the problem of learning an optimal policy for a Markov decision process, from onpolicy interaction. A Markov decision process consists of
where is the set of states; is the set of actions;provides the transition probabilities;
is the reward function; and is the transitionbased discount function which enables either continuing or episodic problems to be specified (White, 2017). On each step, the agent selects action in state , and transitions to , according to , receiving reward and discount . For a policy , where , the value at a given state , taking action , is the expected discounted sum of future rewards, with actions selected according to into the future,For problems in which can be stored in a table, a fixed point for the actionvalues exists for a given . In most domains, must be approximated by , parametrized by .
In the case of linear function approximation, stateaction features are used to approximate actionvalues . The weights can be learned with a stochastic approximation algorithm, called temporal difference (TD) learning (Sutton, 1988). The TD update (Sutton, 1988) processes samples one at a time, , with for . The eligibility trace facilitates multistep updates via an exponentially weighted memory of previous feature activations decayed by and . Alternatively, we can directly compute the weight vector found by TD using leastsquares temporal difference learning (LSTD) (Bradtke and Barto, 1996). The LSTD solution is more dataefficient, and can avoid the need to tune TD’s stepsize parameter . The LSTD update can be efficiently computed incrementally without approximation or storing the data (Bradtke and Barto, 1996; Boyan, 2002), by maintaining a matrix and vector ,
(1) 
The value function approximation at time step is the weights that satisfy the linear system . In practice, the inverse of the matrix is maintained using a ShermanMorrison update, with a small regularizer added to the matrix to guarantee invertibility (Szepesvari, 2010).
One approach to ensure systematic exploration is to initialize the agent’s value estimates optimistically. The actionvalue function must be initialized to predict the maximum possible return (or greater) from each state and action. For example, for costtogoal problems, with 1 per step, the values can be initialized to zero. For continuing problems, with constant discount , the values can be initialized to , if the maximum reward
is known. For fixed features that are nonnegative and encode locality—such as tile coding or radial basis functions—the weights
can be simply set to , to make optimistic.More generally, however, it can be problematic to use optimistic initialization. Optimistic initialization assumes the beginning of time is special—a period when systematic exploration should be performed after which the agent should more or less exploit its current knowledge. Many problems are nonstationary—or at least benefit from a tracking approach due to aliasing caused by function approximation—and benefit from continual exploration. Further, unlike for fixed features, it is unclear how to set and maintain initial values at
for learned features, such as with neural networks. Optimistic initialization is also not straightforward for algorithms like LSTD, which completely overwrite the estimate
on each step with a closedform solution. In fact, we have found that this issue with LSTD has been obfuscated, because the regularizer has inadvertently played a role in providing optimism (see Appendix A). Rather, to use optimism in LSTD for control, we need to explicitly compute upper confidence bounds.Confidence intervals around actionvalues, then, provide another mechanism for exploration in reinforcement learning. Consider action selection with explicit confidence intervals around mean estimates , with estimated radius . The action selection is greedy w.r.t. to these optimistic values, , which provides a highconfidence upper bound on the best possible value for that action. The use of upper confidence bounds on value estimates for exploration has been wellstudied and motivated theoretically in online learning (Chu et al., 2011). In reinforcement learning, there have only been a few specialized proofs for particular algorithms using optimistic estimates (Grande et al., 2014; Osband et al., 2016b), but the result can be expressed more generally by using the idea of stochastic optimism. We extract the central argument by Osband et al. (2016b) to provide a general Optimistic Values Theorem in Appendix B. In particular, similar to online learning, we can guarantee that greedyaction selection according to upper confidence values will converge to the optimal policy, if the confidence interval radius shrinks to zero, if the algorithm to estimate actionvalues for a policy converges to the corresponding actions and if upper confidence estimates are stochastically optimal—remain above the optimal actionvalues in expectation.
Motivated by this result, we pursue principled ways to compute upper confidence bounds for the general, online reinforcement learning setting. We make a step towards computing such values incrementally, under function approximation, by providing upper confidence bounds for value estimates made by LSTD, for a fixed policy. We approximate these bounds to create a new algorithm for control—called UpperConfidenceLeastSquares (UCLS).
3 Estimating Upper Confidence Bounds for Policy Evaluation using LSTD
Consider the goal of obtaining a confidence interval around value estimates learned incrementally by LSTD for a fixed policy . The value estimate is for stateaction features for the current state and action. We would like to guarantee, with probability for a small , that the confidence interval around this estimate contains the value given by the optimal . To estimate such an interval without parametric assumptions, we use Chebyshev’s inequality which—unlike other concentration inequalities like Hoeffding or Bernstein—does not require independent samples.
To use this inequality, we need to determine the variance of the estimate ; the variance of the estimate, given , is due to the variance of the weights. Let be fixed point solution for the projected Bellman operator for the return—the TD fixed point, for a fixed policy . To characterize the noise for this optimal estimator, let be the TDerror for the optimal weights , where
(2) 
The expectation is taken across all states weighted by the sampling distribution, typically the stationary distribution or in the offpolicy case the stationary distribution of the behaviour policy. We know that , by the definition of the Projected Bellman Error fixed point.
This noise
is incurred from the variability in the reward, the variability in the transition dynamics and potentially the capabilities of the function approximator. A common assumption—when using linear regression for contextual bandits
(Li et al., 2010) and for reinforcement learning (Osband et al., 2016b)—is that the variance of the target is a constant value for all contexts . Such an assumption, however, is likely to produce larger confidence intervals than necessary. For example, consider a onestate world with two actions, where one action has a high variance reward and the other has a lower variance reward (see Appendix A, Figure 4). A global sample variance will encourage both actions to be taken many times. For dataefficient exploration, however, the agent should take the highvariance action more, and only needs a few samples from the lowvariance action.We derive a confidence interval for LSTD, in Theorem 1. We also derive the confidence interval assuming a global variance in Corollary 1, to provide a comparison. We compare to using this globalvariance upper confidence bound in our experiments, and show that it results in significantly worse performance than using a contextdependent variance. Note that we do not assume is invertible; if we did, the bigO term in (C) below would disappear. We include this term for preciseness of the result—even though we will not estimate it—because for smaller , is unlikely to be invertible. However, we expect this bigO term to get small quickly, and be dominated by the other terms. In our algorithm, therefore, we ignore the bigO term.
Theorem 1.
Let and where is the pseudoinverse of . Let reflect the degree to which is not invertible; it is zero when is invertible. Assume that the following are all finite: , and all stateaction features . With probability at least , given stateaction features ,
(3) 
Proof: First we compute the mean and variance for our learned parameters. Because ,
This estimate has a small amount of bias, that vanishes asymptotically. But, for a finite sample,
Further, because may not be invertible, there is an additional error term which will vanish with enough samples, i.e., once can be guaranteed to be invertible.
For covariance, because
the covariance of the weights is
The goal for computing variances is to use a concentration inequality. Chebyshev’s inequality^{1}^{1}1Bernstein’s inequality cannot be used here because we do not have independent samples. Rather, we characterize behaviour of the random variable , using variance of , but cannot use bounds that assume is the sum of independent random variables. The bound with Chebyshev will be loose, but we can better control the looseness of the bound with the selection of and the constant in front of the square root.
states that for a random variable
, if the and are bounded, then for any :If we set , then this gives
Now we have characterized the variance of the weights, but what we really want is to characterize the variance of the value estimates. Notice that the variance of the valueestimate, for stateaction is
Therefore, the variance of the estimate is characterized by the variance of the weights. With high probability,
(4)  
(5) 
where Equation 4 uses Chebyshev’s inequality, and the last step is a rewriting of Equation 4 using the definitions and .
To simplify (5), we need to determine an upper bound for the general formula where . Because , we know that . Therefore, the extremal points for , and , both result in an upper bound of . Taking the derivative of the objective, gives a single stationary point inbetween , with . The value at this point evaluates to be . Therefore, this objective is upperbounded by .
Now for , the term involving should quickly disappear, since it is only due to the potential lack of invertibility of . This term is equal to , which results in the additional in the bound.
Corollary 1.
Assume that are i.i.d., with mean zero and bounded variance . Let and assume that the following are finite: , , and all stateaction features . With probability at least , given stateaction features ,
(6) 
Proof: The result follows similarly to above, with some simplifications due to globalvariance:
4 UCLS: Estimating upper confidence bounds for LSTD in control
In this section, we present Upper Confidence Least Squares (UCLS)^{2}^{2}2We do not characterize the regret of UCLS, and instead similarly to policy iteration, rely on a sound update under a fixed policy to motivate incrementally estimating these values as if the policy is fixed and then acting according to them. The only modelfree algorithm that achieves a regret bound is RLSVI, but that bound is restricted to the finite horizon, batch, tabular setting. It would be a substantial breakthrough to provide such a regret bound, and is beyond the scope of this work., a control algorithm, which incrementally estimates the upper confidence bounds provided in Theorem 1, for guiding onpolicy exploration. The upper confidence bounds are sound without requiring i.i.d. assumptions; however, they are derived for a fixed policy. In control, the policy is slowly changing, and so instead we will be slowly tracking this upper bound. The general strategy, like policy iteration, is to slowly estimate both the value estimates and the upper confidence bounds, under a changing policy that acts greedily with respect to the upper confidence bounds. Tracking these upper bounds incurs some approximations; we identify and address potential issues here. The complete psuedocode for UCLS is given in the Appendix (Algorithm 2).
First, we are not evaluating one fixed policy; rather, the policy is changing. The estimates and will therefore be outofdate. As is common for LSTD with control, we use an exponential moving average, rather than a sample average, to estimate , and the upper confidence bound. The exponential moving average uses , for some . If , then this reduces to the standard sample average; otherwise, for a fixed , such as , more recent samples have a higher weight in the average. Because an exponential average is unbiased, the result in Theorem 1 would still hold, and in practice the update will be more effective for the control setting.
Second, we cannot obtain samples of the noise , which is the TDerror for the optimal value function parameters (see Equation (2)). Instead, we use as a proxy. This proxy results in an upper bound that is too conservative—too loose—because is likely to be larger than . This is likely to ensure sufficient exploration, but may cause more exploration than is needed. The moving average update
(7) 
should also help mitigate this issue, as older are likely larger than more recent ones.
Third, the covariance matrix estimating
could underestimate covariances, depending on a skewed distribution over states and depending on the initialization. This is particularly true in early learning, where the distribution over states is skewed to be higher near the start state; a sample average can result in underestimates in as yet unvisited parts of the space. To see why, let
. The covariance estimate corresponds to feature and . The agent begins in a certain region of the space, and so features that only become active outside of this region will be zero, providing samples . As a result, the covariance is artificially driven down in unvisited regions of the space, because the covariance accumulates updates of 0. Further, if the initialization to the covariance is an underestimate, a visited state with high variance will artificially look more optimistic than an unvisited state.We propose two simple approaches to this issue: updating based on locality and adaptively adjusting the initialization to . Each covariance estimate for features and should only be updated if the sampled outerproduct is relevant, with the agent in the region where and are active. To reflect this locality, each is updated with the only if the eligibility traces is nonzero for and . To adaptively update the initialization, the maximum observed is stored, as , and the initialization to each is retroactively updated using
where is the number of times has been updated. This update is equivalent to having initialized . We provide a more stable retroactive update to , in the pseudocode in Algorithm 2, that is equivalent to this update.
Fourth, to improve the computational complexity of the algorithm, we propose an alternative, incremental strategy for estimating , that takes advantage of the fact that we already need to estimate the inverse of for the upper bound. In order to do so, we make use of the summarized information in to improve the update, but avoid directly computing as it may be poorly conditioned. Instead, we maintain an approximation that uses a simple gradient descent update, to minimize . If is the inverse of , then this loss is zero; otherwise, minimizing it provides an approximate inverse. This estimate is useful for two purposes in the algorithm. First, it is clearly needed to estimate the upper confidence bound. Second, it also provides a preconditioner for the iterative update , for preconditioner . The optimal preconditioner is in fact the inverse of , if it exists. We use for a small to ensure that the preconditioner is full rank. Developing this stable update for LSTD required significant empirical investigation into alternatives; in addition to providing a more practical UCLS algorithm, we hope it can improve the use of LSTD in other applications.
5 Experiments
We conducted several experiments to investigate the benefits of UCLS’ directed exploration against other methods that use confidence intervals for action selection, to evaluate sensitivity of UCLS’s performance with respect to its key parameter , and to contrast the advantage contextual variance estimates offer over global variance estimates in control. Our experiments were intentionally conducted in small—though carefully selected—simulation domains so that we could conduct extensive parameter sweeps, hundreds of runs for averaging, and compare numerous stateoftheart exploration algorithms (many of which are computationally expensive on larger domains). We believe that such experiments constitute a significant contribution, because effectively using confidence bounds for model freeexploration in RL is still in its infancy—not yet at the largescale demonstration state–with much work to be done. This point is highlighted nicely below as we demonstrate that several recently proposed exploration methods fail on these simple domains.
5.1 Algorithms
We compare UCLS to DGPQ (Grande et al., 2014), UCBootstrap (White and White, 2010), our extension of LSPIRmax to an incremental setting (Li et al., 2009) and RLSVI (Osband et al., 2016b). Indepth descriptions of each algorithm and implementation details can be found in the Appendix. These algorithms are chosen because they either keep confidence intervals explicitly, as in UCBootstrap, or implicitly as in DGPQ and RLSVI. In addition, we included LSPIRmax as a natural alternative approach to using LSTD to maintain optimistic value estimates.
We also include Sarsa with greedy, with optimized over an extensive parameter sweep. Though
greedy is not a generally practical algorithm, particularly in larger worlds, we include it as a baseline. We do not include Sarsa with optimistic initialization, because even though it has been a common heuristic, it is not a general strategy for exploration. Optimistic initialization can converge to suboptimal solutions if initial optimism fades too quickly
(White and White, 2010). Further, initialization only happens once, at the beginning of learning. If the world changes, then an agent relying on systematic exploration due to its initialization may not react, because it no longer explores. For completeness comparing to previous work using optimistic initialization, we include such results in Appendix G.5.2 Environments
Sparse Mountain Car is a version of classic mountain car problem Sutton and Barto (1998), only differing in the reward structure. The agent only receives a reward of at the goal and otherwise, and a discounted, episodic of . The start state is sampled from the range with velocity zero. This domain is used to highlight how exploration techniques perform when the reward signal is sparse, and thus initializing the value function to zero is not optimistic.
Puddle World is a continuous state 2dimensional world with with 2 puddles: (1) to , and (2) to  with radius 0.1 and the goal is the region . The agent receives a reward of on each time step, where denotes the distance between the agent’s position and the center of the puddle, and an undiscounted, episodic of . The agent can select an action to move , . The agent’s initial state is uniformly sampled from . This domain highlights a common difficulty for traditional exploration methods: high magnitude negative rewards, which often cause the agent to erroneously decrease its value estimates too quickly.
River Swim is a standard continuing exploration benchmark Szita and Lorincz (2008) inspired by a fish trying to swim upriver, with high reward (+1) upstream which is difficult to reach and, a lower but still positive reward (+0.005), which is easily reachable downstream. We extended this domain to continuous states in , with a stochastic displacement of when taking an action up or down, with lowprobability of success for up. The starting position is sampled uniformly in , and .
5.3 Experimental Setup
We investigate a learning regime where the agents are allowed a fixed budget of interaction steps with the environment, rather than allowing a finite number of episodes of unlimited length. Our primary concern is early learning performance, thus each experiment is restricted to 50,000 steps, with an episode cutoff (in Sparse Mountain Car and Puddle World) at 10,000 steps. In this regime, an agent that spends a significant time exploring the world during the first episode may not be able to complete many episodes, the cutoff makes exploration easier given the strict budget on experience. Whereas, in the more common framework of allowing a fixed number of episodes, an agent can consume many steps during the first few episodes exploring, which is difficult to detect in the final performance results. We average over 100 runs in River Swim and 200 runs for the other domains . For all the algorithms that utilize eligibility traces we set to be 0.9. For algorithms which use exponential averaging, is set to 0.001, and the regularizer is set to be 0.0001. The parameters for UCLS are fixed. RLSVI’s weights are recalculated using all experienced transitions at the beginning of an episode in Puddle World and Sparse Mountain Car, and every 5,000 steps in River Swim. The parameters of competitors, where necessary, are selected as the best from a large parameter sweep.
All the algorithms except DGPQ use the same representation: (1) Sparse Mountain Car  8 tilings of 8x8, hashed to a memory space of 512, (2) River Swim  4 tilings of granularity 32, hashed to a memory space of 128, and (3) Puddle World  5 tilings of granularity 5x5, hashed to a memory space of 128. DGPQ uses its own kernelbased representation with normalized state information.
5.4 Results & Analysis
Our first experiment simply compares UCLS against other control algorithms in all the domains. Figure 1 shows the early learning results across all three domains. In all three domains UCLS achieves the best final performance. In Sparse Mountain Car, UCLS learns faster than the other methods, while in River Swim DGPQ learns faster initially. UCBootstrap and UCLS learn at a similar rate in Puddle World, which is a costtogoal domain. UCBootstrap, and bootstrapping approaches generally, can suffer from insufficient optimism, as they rely on sufficiently optimistic or diverse initialization strategies (White and White, 2010; Osband et al., 2016a). LSPIRmax and RLSVI do not perform well in any of the domains. DGPQ does not perform as well as UCLS in Puddle World, and exhibits high variance compared with the other methods. In Puddle World, UCLS goes on to finish 1200 episodes in the alloted budget of steps, whereas in River Swim both UCLS and DGPQ get close to the optimal policy by the end of the experiment.
The DGPQ algorithm uses the maximum reward (Rmax) to initialize the Gaussian processes. In Sparse Mountain Car this effectively converts the problem back into the traditional 1 perstep formulation. In this traditional variant of Mountain Car UCLS significantly outperforms DGPQ (Appendix G). Sarsa with greedy learns well in Puddle world as it is a costtogoal problem in which by default Sarsa uses optimistic initialization, and therefore is reported in the Appendix. .
Next we investigated the impact of the confidence level , on the performance of UCLS in River Swim. The confidence interval radius is proportional to ; smaller should correspond to a higher rate of exploration. In Figure 2, smaller resulted in a slower convergence rate, but all values eventually reach the optimal policy.
Finally, we investigate the benefit using contextual variance estimates over global variance estimates within UCLS. In Figure 2, we also show the effect of various values on the performance of the algorithm resulting from Corollary 1, which we call Global VarianceUCB (GVUCB) (see Appendix E.1 for more details about this algorithm). For this range of , UCLS still converges to the optimal policy, albeit at different rates. Using a global variance estimates (GVUCB), on the other hand, results in significant overestimates of variance, resulting in poor performance.
6 Conclusion and Discussion
This paper develops a sound upper confidence bound on the value estimates for leastsquares temporal difference learning (LSTD), without making i.i.d. assumptions about noise distributions. In particular, we allow for contextdependent noise, where variability could be due to noise in rewards, transition dynamics or even limitations of the function approximator. We then introduce an algorithm, called UCLS, that estimates these upper confidence bounds incrementally, for policy iteration. We demonstrate empirically that UCLS requires far fewer exploration steps to find highquality policies compared to several baselines, across domains chosen to highlight different exploration difficulties.
The goal of this paper is to provide an incremental, modelfree, dataefficient, directed exploration strategy. The upper confidence bounds for actionvalues for fixed policies are one of the few available under function approximation, and so a step towards exploration with optimistic values in the general case. A next step is to theoretically show that using these upper bounds for exploration ensures stochastic optimism, and so converges to optimal policies.
One promising aspect of UCLS is that it uses leastsquares to efficiently summarize past experience, but is not tied to a specific state representation. Though we considered a fixed representation for UCLS, it is feasible that an analysis for the nonstationary case could be used as well for the setting where the representation is being adapted over time. If the representation drifts slowly, then UCLS may be able to similarly track the upper confidence bounds. Recent work has shown that combining deep Qlearning with Leastsquares can result in significant performance gains over vanilla DQN(Levine et al., 2017). We expect that combining deep networks and UCLS could result in even larger gains, and is a natural direction for future work.
7 Acknowledgements
We would like to thank Bernardo Ávila Pires, and Jian Qian for their helpful comments.
References

AbbasiYadkori and Szepesvari [2014]
Y. AbbasiYadkori and C. Szepesvari.
Bayesian Optimal Control of Smoothly Parameterized Systems: The Lazy
Posterior Sampling Algorithm.
In
Uncertainty in Artificial Intelligence
, 2014.  Auer and Ortner [2006] P. Auer and R. Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning. Advances in Neural Information Processing Systems, 2006.
 Bartlett and Tewari [2009] P. L. Bartlett and A. Tewari. REGAL  A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs. In Conference on Uncertainty in Artificial Intelligence, 2009.
 Boyan [2002] J. A. Boyan. Technical update: Leastsquares temporal difference learning. Machine learning, 49(23):233–246, 2002.
 Bradtke and Barto [1996] S. J. Bradtke and A. G. Barto. Linear leastsquares algorithms for temporal difference learning. Machine learning, 22(13):33–57, 1996.
 Brafman and Tennenholtz [2003] R. Brafman and M. Tennenholtz. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. The Journal of Machine Learning Research, 2003.
 Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual Bandits with Linear Payoff Functions. In International Conference on Artificial Intelligence and Statistics, 2011.
 Grande et al. [2014] R. Grande, T. Walsh, and J. How. Sample Efficient Reinforcement Learning with Gaussian Processes. International Conference on Machine Learning, 2014.
 Jaksch et al. [2010] T. Jaksch, R. Ortner, and P. Auer. Nearoptimal Regret Bounds for Reinforcement Learning. The Journal of Machine Learning Research, 2010.
 Jong and Stone [2007] N. Jong and P. Stone. Modelbased exploration in continuous state spaces. Abstraction, Reformulation, and Approximation, 2007.
 Jung and Stone [2010] T. Jung and P. Stone. Gaussian processes for sample efficient reinforcement learning with RMAXlike exploration. In Machine Learning: ECML PKDD, 2010.
 Kaelbling [1993] L. P. Kaelbling. Learning in embedded systems. MIT press, 1993.
 Kaelbling et al. [1996] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 1996.
 Kakade et al. [2003] S. Kakade, M. Kearns, and J. Langford. Exploration in metric state spaces. In International Conference on Machine Learning, 2003.
 Kawaguchi [2016] K. Kawaguchi. Bounded Optimal Exploration in MDP. In AAAI Conference on Artificial Intelligence, 2016.
 Kearns and Singh [2002] M. J. Kearns and S. P. Singh. NearOptimal Reinforcement Learning in Polynomial Time. Machine Learning, 2002.
 Lagoudakis and Parr [2003] M. G. Lagoudakis and R. Parr. Leastsquares policy iteration. The Journal of Machine Learning Research, 2003.
 Levine et al. [2017] N. Levine, T. Zahavy, D. J. Mankowitz, A. Tamar, and S. Mannor. Shallow updates for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3138–3148, 2017.
 Li et al. [2009] L. Li, M. Littman, and C. Mansley. Online exploration in leastsquares policy iteration. In International Conference on Autonomous Agents and Multiagent Systems, 2009.
 Li et al. [2010] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. In World Wide Web Conference, 2010.
 Martin et al. [2017] J. Martin, S. N. Sasikumar, T. Everitt, and M. Hutter. CountBased Exploration in Feature Space for Reinforcement Learning. In International Joint Conference on Artificial IntelligenceI, 2017.
 Meuleau and Bourgine [1999] N. Meuleau and P. Bourgine. Exploration of MultiState Environments  Local Measures and BackPropagation of Uncertainty. Machine Learning, 1999.
 Meyer [1973] C. D. Meyer, Jr. Generalized inversion of modified matrices. SIAM Journal on Applied Mathematics, 24(3):315–323, 1973.
 Miller [1981] K. S. Miller. On the inverse of the sum of matrices. Mathematics magazine, 54(2):67–72, 1981.
 Moerland et al. [2017] T. M. Moerland, J. Broekens, and C. M. Jonker. Efficient exploration with Double Uncertain Value Networks. In Advances in Neural Information Processing Systems, 2017.
 Nouri and Littman [2009] A. Nouri and M. L. Littman. Multiresolution Exploration in Continuous Spaces. In Advances in Neural Information Processing Systems, 2009.
 Ortner and Ryabko [2012] R. Ortner and D. Ryabko. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning. In Advances in Neural Information Processing Systems, 2012.
 Osband and Van Roy [2017] I. Osband and B. Van Roy. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? In International Conference on Machine Learning, 2017.
 Osband et al. [2013] I. Osband, D. Russo, and B. Van Roy. (More) Efficient Reinforcement Learning via Posterior Sampling. In Advances in Neural Information Processing Systems, 2013.
 Osband et al. [2016a] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep Exploration via Bootstrapped DQN. In Advances in Neural Information Processing Systems, 2016a.
 Osband et al. [2016b] I. Osband, B. Van Roy, and Z. Wen. Generalization and Exploration via Randomized Value Functions. In International Conference on Machine Learning, 2016b.
 Ostrovski et al. [2017] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. CountBased Exploration with Neural Density Models. In International Conference on Machine Learning, 2017.
 Pazis and Parr [2013] J. Pazis and R. Parr. PAC optimal exploration in continuous space Markov decision processes. In AAAI Conference on Artificial Intelligence, 2013.
 Plappert et al. [2017] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz. Parameter Space Noise for Exploration. arXiv.org, 2017.
 Singh et al. [2000] S. P. Singh, T. S. Jaakkola, M. L. Littman, and C. Szepesvari. Convergence Results for SingleStep OnPolicy ReinforcementLearning Algorithms. Machine Learning, 2000.
 Strehl and Littman [2004] A. Strehl and M. Littman. Exploration via model based interval estimation. In International Conference on Machine Learning, 2004.
 Strehl et al. [2006] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC modelfree reinforcement learning. In International Conference on Machine Learning, 2006.
 Sutton et al. [2008] R. Sutton, C. Szepesvári, A. Geramifard, and M. Bowling. Dynastyle planning with linear function approximation and prioritized sweeping. In Conference on Uncertainty in Artificial Intelligence, 2008.
 Sutton [1988] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988.
 Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
 Szepesvari [2010] C. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010.
 Szita and Lorincz [2008] I. Szita and A. Lorincz. The many faces of optimism. In International Conference on Machine Learning, 2008.
 Szita and Szepesvari [2010] I. Szita and C. Szepesvari. Modelbased reinforcement learning with nearly tight exploration complexity bounds. In International Conference on Machine Learning, 2010.
 van Seijen and Sutton [2015] H. van Seijen and R. Sutton. A deeper look at planning as learning from replay. In International Conference on Machine Learning, 2015.
 White [2017] M. White. Unifying task specification in reinforcement learning. In International Conference on Machine Learning, 2017.
 White and White [2010] M. White and A. White. Interval estimation for reinforcementlearning algorithms in continuousstate domains. In Advances in Neural Information Processing Systems, 2010.
 Wiering and Schmidhuber [1998] M. A. Wiering and J. Schmidhuber. Efficient ModelBased Exploration. In Simulation of Adaptive Behavior From Animals to Animats, 1998.
Appendix A Issues with LSTD for control
LSTD is a more dataefficient algorithm than its incremental counterpart TD, and typically performs quite well in policy evaluation. This is primarily due to TD only using each sample once for a stochastic update with a tuned stepsize parameter. In the case of control, LSTD performs surprisingly well without greedy exploration and lack of an optimism strategy. We highlight here the inadvertent use of the regularization parameter as a form of optimism for LSTD in control, and empirically show when this strategy fails leading us to UCLS as a sound approach in using LSTD in control.
In practice, the inverted matrix
is often directly maintained using a ShermanMorrison update, with a small regularizer added to the matrix to guarantee invertibility [Szepesvari, 2010].There are two objectives that can be solved when dealing with an illconditioned system . The most common is to use Tikohonov regularization solving, referred to here as LSTDout.
Another approach is to solve the system
The second approach is implicitly what is solved when a ShermanMorrison update is used for , with a small regularizer added to the matrix to guarantee invertibility. This approach is referred to here as LSTDin. When , both approaches are solving , which may have infinitely many solutions if is not full rank. While the Tikohonov regularization strategy is more common, the second approach is useful for enabling use of the incremental ShermanMorrison update to facilitate maintaining directly.
Another choice in regularizing the illconditioned system is in how decays over time. A small fixed can be used as a constant regularizer, even as the number of samples increases, because the true may be illconditioned. However, more regularization could also be used at the beginning and then decayed over time. The incremental ShermanMorrison update implicitly decays proportionally to .
We conducted an empirical study using LSTD without an greedy exploration strategy in two domains: Mountain Car and a new OneState world. OneState world—depicted in Figure 4—simulates a typical setting where sufficient exploration is needed: one outcome with low variance and lower expected value and one outcome with high variance and higher expected value. For an algorithm that does not explore sufficiently, it is likely to settle on the suboptimal action, but more immediately rewarding lowvariance outcome. This world simulates a larger continuous navigation task from White and White, 2010. We include results for both systems described above and consider a fading version (shown by F) or a constant regularization parameter (shown by C).
Figure 3 shows results for the four different LSTD strategies in Mountain Car. The Tikohonov regularization, with , is unable to learn an optimal policy in this domain, whereas with either constant or fading , the agent can learn an optimal policy. This is surprising, considering we use neither randomized exploration nor optimistic initialization. The parameter sensitivity curve, shown in plot c, indicates and needs to be sufficiently large as time passes in order to find an optimal policy.
Next, we show that neither regularization strategy with fading is effective in the OneState world. The optimal strategy is to take the Right action, to get an expected reward of under a higher variance for obtaining rewards. All of the LSTD variants fail for this domain, because no longer plays a role in encouraging exploration. To verify that a directed exploration strategy helps, we experiment with greedy exploration, with , decayed by a factor of every 100 steps (shown in Figure 5). With greedy, and small values of and , the policy converges to the optimal action, whereas it fails to with higher values of and .
These results suggest that ’s role in exploration has obscured our understanding of how to use LSTD for control. LSTD, with sufficient optimism does seem to reach optimal solutions, and unlike Sutton et al. [2008], we did not find any issues with forgetting. This further explains why there have been previous results with small for LSTD in costtogoal problems, that nonetheless still obtained the optimal policy [van Seijen and Sutton, 2015]. Therefore, in developing UCLS, we more explicitly add optimism to LSTD, and ensure is strictly used as a regularization parameter (to ensure wellconditioned updates).
Appendix B Optimistic Values Theorem
The use of upper confidence bounds on value estimates for exploration has been wellstudied and motivated theoretically in online learning [Chu et al., 2011]. For reinforcement learning, though, there are only specialized proofs for particular algorithms using optimistic estimates [Grande et al., 2014, Osband et al., 2016b]. To better motivate and appreciate the use of upper confidence bounds for reinforcement learning, we extract the key argument from Osband et al. [2016b], which uses the idea of stochastic optimism.
Under function approximation, it may not be possible to obtain the optimal policy exactly. Instead, our criterion is to obtain the optimal policy according to the following formulation, assuming greedyaction selection from actionvalues. Let be the actionvalues for the optimal policy, under the chosen density over states and actions
(8) 
This optimization does not preclude being related to the trajectory of optimal policy, but generically allows specification of any density, such as one putting all weight on a set of start states or such as one that is uniform across states and actions to ensure optimality from any point in the space. The optimal policy in this setting is the policy that corresponds to acting greedily w.r.t. ; depending on the function space , this may only be an approximately optimal policy. The design of the agent is directed towards this goal, though we do not explicitly optimize this objective.
Let be the estimated actionvalues plus the confidence interval radius on time step , to get the estimated upper confidence bound which the agent uses to select actions. Let be the policy induced by greedy action selection on .
Assumption 1 (Stochastic Optimism).
At some point , the actionvalues at every step are stochastically optimistic: , with expectation according to a specified density .
Assumption 2 (Shrinking Confidence Interval Radius).
The confidence interval radius goes to zero: for some nonnegative function with .
Assumption 3 (Convergent Action Values).
The estimated actionvalues approach the true actionvalues for policy : for some nonnegative function with .
These assumptions are heavily dependent on the distribution utilized to evaluate the expectation. If the expectations are w.r.t. the stationary distribution induced by the optimal policy (), it is easy to see that they could be satisfied  as the density is nonzero only for the optimal stateaction pairs. In contrast, if the density is a uniform density over the space, then these assumptions may not be satisfied.
Given the three key assumptions, the theorem below is straightforward to prove. However, these three conditions are fundamental, and do not imply each other. Therefore, this result highlights what would need to be shown, to obtain the Optimistic Values Theorem. For example, Assumption 1 and 2 do not imply Assumption 3, because the confidence interval radius could decrease to zero, and still be stochastically optimistic and an overestimate of values that correspond to a suboptimal policy. Assumption 1 and 3 do not imply Assumption 2, because could converge to the policy corresponding to acting greedily w.r.t. , but may never fade away. Then, could still be stochastically optimistic, but the policy could be suboptimal because it is acting greedily according to inaccurate, inflated estimates of value .
Theorem 2 (Optimistic Values Theorem).
Under Assumptions 1, 2 and 3,
Proof: Consider the regret across states and actions
because by Assumption 1. By Assumptions 2 and 3,
completing the proof.
This result is intentionally abstract, where the three assumptions could be satisfied in a variety of ways. These assumptions have been verified for one algorithm, called RLSVI, under a tabular setting using a finitehorizon specification [Osband et al., 2016b], which simplifies ensuring stochastic optimism (Assumption 1). We hypothesize that the last two assumptions could be addressed with a twotimescale analysis, with confidence interval radius updating more slowly than . This would reflect an iterative approach, where the optimistic values are essentially held fixed—such as is done in Delayed Qlearning [Grande et al., 2014]—and estimated, before then adjusting the optimistic values. The updates to , then, would be updated on a faster timescale, converging to , and the upper confidence radius updating on a slower timescale.
Appendix C Estimating Upper Confidence Bounds for Policy Evaluation using linear TD
Recall that the TD update [Sutton, 1988] processes one sample at a time as to estimate the solution to the leastsquares system in an incremental manner. This is feasible as the following holds:
Therefore,