1 Introduction
Exploration is one of the central challenges in reinforcement learning (RL). A large theoretical literature treats exploration in simple finite state and action MDPs, showing that it is possible to efficiently learn a near optimal policy through interaction alone [16, 8, 15, 24, 25, 13, 10, 5, 11, 14]. Overwhelmingly, this literature focuses on optimistic algorithms, with most algorithms explicitly maintaining uncertainty sets that are likely to contain the true MDP.
It has been difficult to adapt these exploration algorithms to the more complex problems investigated in the applied RL literature. Most applied papers seem to generate exploration through –greedy or Boltzmann exploration. Those simple methods are compatible with practical value function learning algorithms, which use parametric approximations to value functions to generalize across high dimensional state spaces. Unfortunately, such exploration algorithms can fail catastrophically in simple finite state MDPs [See e.g. 22]. This paper is inspired by the search for principled exploration algorithms that both (1) are compatible with practical function learning algorithms and (2) provide robust performance, at least when specialized to simple benchmarks like tabular MDPs.
Our focus will be on methods that generate exploration by planning with respect to randomized value function estimates. This idea was first proposed in a conference paper by [21] and is investigated more thoroughly in the journal paper [22]
. It is inspired by work on posterior sampling for reinforcement learning (a.k.a Thompson sampling)
[26, 19], which could be interpreted as sampling a value function from a posterior distribution and following the optimal policy under that value function for some extended period of time before resampling. A number of papers have subsequently investigated approaches that generate randomized value functions in complex reinforcement learning problems [20, 6, 12, 9, 23, 27, 28]. Our theory will focus on a specific approach of [21, 22], dubbed randomized least squares value iteration (RLSVI), as specialized to tabular MDPs. The name is a play on the classic leastsquares policy iteration algorithm (LSPI) of [17]. RLSVI generates a randomized value function (essentially) by judiciously injecting Gaussian noise into the training data and then applying applying LSPI to this noisy dataset. One could naturally apply the same template while using other value learning algorithms in place of LSPI.This is a strikingly simple algorithm, but providing rigorous theoretical guarantees has proved challenging. One challenge is that, despite the appealing conceptual connections, there are significant subtleties to any precise link between RLSVI and posterior sampling. The issue is that posterior sampling based approaches are derived from a true Bayesian perspective in which one maintains beliefs over the underlying MDP. The approaches of [22, 27, 23, 6, 12, 28, 9] model only the value function, so Bayes rule is not even well defined.^{1}^{1}1The precise issue is that, even given a prior over value functions, there is no likelihood function. Given and MDP, there is a well specified likelihood of transitioning from state to another , but a value function does not specify a probabilistic datagenerating model. The work of [21, 22] uses stochastic dominance arguments to relate the value function sampling distribution of RLSVI to a correct posterior in a Bayesian model where the true MDP is randomly drawn. This gives substantial insight, but the resulting analysis is not entirely satisfying as a robustness guarantee. It bounds regret on average over MDPs with transitions kernels drawn from a particular Dirichilet prior, but one may worry that hard reinforcement learning instances are extremely unlikely under this particular prior.
This paper develops a very different proof strategy and provides a worstcase regret bound for RLSVI applied to tabular finitehorizon MDPs. The crucial proof steps are to show that each randomized value function sampled by RLSVI has a significant probability of being optimistic (see Lemma
4) and then to show that from this property one can reduce regret analysis to concentration arguments pioneered by [13] (see Lemmas 6, 7). This approach is inspired by frequentist analysis of Thompson sampling for linear bandits [2] and especially the lucid description of [1]. However, applying these ideas in reinforcement learning appears to require novel analysis. The only prior extension of these proof techniques to tabular reinforcement learning was carried out by [3]. Reflecting the difficulty of such analyses, that paper does not provide regret bounds for a pure Thompson sampling algorithm; instead their algorithm samples many times from the posterior to form an optimistic model, as in the BOSS algorithm [4]. Also, unfortunately there is a significant error that paper’s analysis and the correction has not yet been posted online, making a careful comparison difficult at this time.The established regret bounds are not state of the art for tabular finitehorizon MDPs. A final step of the proof applies techniques of [13], introducing an extra in the bounds. I hope some smart reader can improve this by intelligently adapting the techniques of [5, 11]. However, the primary goal of the paper is not to give the tightest possible regret bound, but to broaden the set of exploration approaches known to satisfy polynomial worstcase regret bounds. To this author, it is both fascinating and beautiful that carefully adding noise to the training data generates sophisticated exploration and proving this formally is worthwhile.
2 Problem formulation
We consider the problem of learning to optimize performance through repeated interactions with an unknown finite horizon MDP . The agent interacts with the environment across episodes. Each episode proceeds over periods, where for period of episode the agent is in state , takes action , observes the reward and, for , also observes next state . Let denote the history of interactions prior to episode . The Markov transition kernel encodes the transition probabilities, with
The reward distribution is encoded in , with
We usually instead refer to expected rewards encoded in a vector
that satisfies . We then refer to an MDP , described in terms of its expected rewards rather than its reward distribution, as this is sufficient to determine the expected value accrued by any policy. The variable denotes a deterministic initial state, and we assume for every episode . At the expense of complicating some formulas, the entire paper could also be written assuming initial states are drawn from some distribution over ,which is more standard in the literature.A deterministic Markov policy is a sequence of functions, where each prescribes an action to play in each state. We let denote the space of all such policies. We use to denote the value function associated with policy in the subepisode consisting of periods . To simplify many expressions, we set . Then the value functions for are the unique solution to the the Bellman equations
The optimal value function is .
An episodic reinforcement learning algorithm Alg is a possibly randomized procedure that associates each history with a policy to employ throughout the next episode. Formally, a randomized algorithm can depend on random seeds drawn independently of the past from some prespecified distribution. Such an episodic reinforcement learning algorithm selects a policy to be employed throughout episode .
The cumulative expected regret incurred by Alg over episodes of interaction with the MDP is
where the expectation is taken over the random seeds used by a randomized algorithm and the randomness in the observed rewards and state transitions that influence the algorithm’s chosen policy. This expression captures the algorithm’s cumulative expected shortfall in performance relative to an omniscient benchmark, which knows and always employs the true optimal policy.
Of course, regret as formulated above depends on the MDP to which the algorithm is applied. Our goal is not to minimize regret under a particular MDP but to provide a guarantee that holds uniformly across a class of MDPs. This can be expressed more formally by considering a class containing all MDPs with states, actions, periods, and rewards distributions bounded in . Our goal is to bound the worstcase regret incurred by an algorithm throughout episodes of interaction with an unknown MDP in this class. We aim for a bound on worstcase regret that scales sublinearly in and has some reasonable polynomial dependence in the size of state space, action space, and horizon. We won’t explicitly maximize over in the analysis. Instead, we fix an arbitrary MDP and seek to bound regret in a way that does not depend on the particular transition probabilities or reward distributions under .
It is worth remarking that, as formulated, our algorithm knows , and but does not have knowledge of the number of episodes . Indeed, we study a socalled anytime algorithm that has good performance for all sufficiently long sequences of interaction.
Notation for empirical estimates.
We define to be the number of times action has been sampled in state , period . For every tuple with , we define the empirical mean reward and empirical transition probabilities up to period by
(1)  
(2) 
If was never sampled before episode , we define and .
3 Randomized Least Squares Value Iteration
This section describes an algorithm called Randomized Least Squares Value Iteration (RLSVI). We describe RLSVI as specialized to a simple tabular problem in a way that is most convenient for the subsequent theoretical analysis. A mathematically equivalent definition – which defines RSLVI as estimating a value function on randomized training data – extends more gracefully . This interpretation is given at the end of the section and more carefully in [22].
At the start of episode , the agent has observed a history of interactions . Based on this, it is natural to consider an estimated MDP with empirical estimates of mean rewards and transition probabilities. These are precisely defined in Equation (2) and the surrounding text. We could use backward recursion to solve for the optimal policy and value functions under the empirical MDP, but applying this policy would not generate exploration.
RLSVI builds on this idea, but to induce exploration it judiciously adds Gaussian noise before solving for an optimal policy. We can define RLSVI concisely as follows. In episode it samples a random vector with independent components , where . We define , where
is a tuning parameter and the denominator shrinks like the standard deviation of the average of
i.i.d samples. Given , we construct a randomized perturbation of the empirical MDP by adding the Gaussian noise to estimated rewards. RLSVI solves for the optimal policy under this MDP and applies it throughout the episode. This policy is, of course, greedy with respect to the (randomized) value functions under . The random noise in RLSVI should be large enough to dominate the error introduced by performing a noisy Bellman update using and . We set in the analysis, where functions ofoffer a coarse bound on quantities like the variance of an empirically estimated Bellman update. For
, we denote this algorithm by .RLSVI as regression on perturbed data.
To extend beyond simple tabular problems, it is fruitful to view RLSVI–like in Algorithm 1–as an algorithm that performs recursive least squares estimation on the stateaction value function. Randomization is injected into these value function estimates by perturbing observed rewards and by regularizing to a randomized prior sample. This prior sample is essential, as otherwise there would no randomness in the estimated value function in initial periods. This procedure is the LSPI algorithm of [17] applied with noisy data and a tabular representation. The paper [22] includes many experiments with nontabular representations.
To understand this presentation of RLSVI, it is helpful to understand an equivalence between posterior sampling in a Bayesian linear model and fitting a regularized least squares estimate to randomly perturbed data. We refer to [22] for a full discussion of this equivalence and review the scalar case here. Consider Bayes updating of a scalar parameter based on noisy observations where . The posterior distribution has the closed form . We could generate a sample from this distribution by fitting a least squares estimate to noise. Sample where each is drawn independently and sample . Then
(3) 
satisfies . For more complex models, where exact posterior sampling is impossible, we may still hope estimation on randomly perturbed data generates samples that reflect uncertainty in a sensible way. As far as RLSVI is concerned, roughly the same calculation shows that in Algorithm 1 is equal to an empirical Bellman update plus Gaussian noise:
4 Main result
Theorem 1 establishes that RLSVI satisfies a worstcase polynomial regret bound for tabular finitehorizon MDPs. It is worth contrasting RLSVI to –greedy exploration and Boltzmann exploration, which are both widely used randomization approaches to exploration. Those simple methods explore by directly injecting randomness to the action chosen at each timestep. Unfortunately, they can fail catastrophically even on simple examples with a finite state space – requiring a time to learn that scales exponentially in the size of the state space. Instead, RLSVI generates randomization by training value functions with randomly perturbed rewards. Theorem 1 confirms that this approach generates a sophisticated form of exploration fundamentally different from –greedy exploration and Boltzmann exploration. The notation ignores polylogarithmic factors in and .
Theorem 1.
Let denote the set of MDPs with horizon , states, actions, and rewards bounded in [0,1]. Then for a tuning parameter sequence with ,
This bound is not state of the art and that is not the main goal of this paper. I conjecture that the extra can be removed from this bound through a careful analysis, making the dependence on , , and , optimal. This conjecture is supported by numerical experiments and (informally) by a Bayesian regret analysis [22]. The extra appears to come from a step at the very end of the proof in Lemma 7, where we bound a certain norm as in the analysis style of [13]. For optimistic algorithms, some recent work has avoided directly bounding that norm, yielding a tighter regret guarantee [5, 11].
Remark 1.
Some translation is required to relate the dependence on with other literature. Many results are given in terms of the number of periods , which masks a factor of . Also unlike e.g. [5], this paper treats timeinhomogenous transition kernels. In some sense agents must learn about extra state/action pairs. Roughly speaking then, our result exactly corresponds to what one would get by applying the UCRL2 analysis [13] to a timeinhomogenous finitehorizon problem.
5 Proof of Theorem 1
The proof follows from several lemmas. Some are (possibly complex) technical adaptations of ideas present in many regret analyses. Lemmas 4 and 6 are the main discoveries that prompted this paper. Throughout we use the following notation: for any MDP , let denote the value function corresponding to policy from the initial state . In this notation, for the true MDP we have .
A concentration inequality.
Through a careful application of Hoeffding’s inequality, one can give a high probability bound on the error in applying a Bellman update to the (nonrandom) optimal value function . Through this, and a union bound, Lemma bounds 2 bounds the expected number of times the empirically estimated MDP falls outside the confidence set
where we define
This set is a only a tool in the analysis and cannot be used by the agent since is unknown.
Lemma 2 (Validity of confidence sets).
From value function error to on policy Bellman error.
For some fixed policy , the next simple lemma expresses the gap between the value functions under two MDPs in terms of the differences between their Bellman operators. Results like this are critical to many analyses in the RL literature. Notice the asymmetric role of and . The value functions correspond to one MDP while the state trajectory is sampled in the other. We’ll apply the lemma twice: once where is the true MDP and is estimated one used by RLSVI and once where the role is reversed.
Lemma 3.
Consider any policy and two MDPs and . Let and denote the respective value functions of under and . Then
where and the expectation is over the sampled state trajectory drawn from following in the MDP .
Proof.
Expanding this recursion gives the result. ∎
Sufficient optimism through randomization.
There is always the risk that, based on noisy observations, an RL algorithm incorrectly forms a low estimate of the value function at some state. This may lead the algorithm to purposefully avoid that state, therefore failing to gather the data needed to correct its faulty estimate. To avoid such scenarios, nearly all provably efficient RL exploration algorithms are based build purposefully optimistic estimates. RLSVI does not do this, and instead generates a randomized value function. The following lemma is key to our analysis. It shows that, except in the rare event when it has grossly misestimated the underlying MDP, RLSVI has at least a constant chance of sampling an optimistic value function. Similar results can be proved for Thompson sampling with linear models [1]. Recall is unknown true MDP with optimal and is RLSVI’s noise perturbed MDP under which is an optimal policy.
Lemma 4.
Let be an optimal policy for the true MDP . If , then
This result is more easily established through the following lemma, which avoids the need to carefully condition on the history at each step. We conclude with the proof of Lemma 4 after.
Lemma 5.
Fix any policy and vector with . Consider the MDP and alternative and obeying the inequality
for every and . Take to be a random vector with independent components where . Let denote the (random) value function of the policy under the MDP . Then
Proof.
To start, we consider an arbitrary deterministic vector (thought of as a possible realization of ) and evaluate the gap in value functions . We can rewrite this quantity by applying Lemma 3. Let denote a random sequence of states drawn by simulating the policy in the MDP from the deterministic initial state . Set for . Then
(4) 
where the expectation is taken over the sequence of sates . Define for any deterministic , sequence of states and actions ,
Then (4) shows where the expectation integrates over the random sequence of sates (which determine the sequence of actions ). Then,
where here integrates over the distribution of .
Let denote the variance of . Our remaining goal is to bound for arbitrary fixed sequences and . We have
Since the mean of is no more than one standard deviation below zero, . ∎
Reduction to bounding online prediction error.
The next Lemma shows that the cumulative expected regret of RLSVI is bounded in terms of the total prediction error in estimating the value function of . The critical feature of the result is it only depends on the algorithm being able to estimate the performance of the policies it actually employs and therefore gathers data about. From here, the regret analysis will follow only concentration arguments. For the purposes of analysis, we let denote an imagined second sample sample drawn from the same distribution as the perturbed MDP under RLSVI. More formally, let where is independent Gaussian noise. Conditioned on the history, has the same marginal distribution as , but it is statistically independent of the policy selected by RLSVI,
Lemma 6.
For an absolute constant , we have
Online prediction error bounds.
We complete the proof with concentration arguments. Set and to be the error in estimating mean the mean reward and transition vector corresponding to . The next result follows by bounding each term in Lemma 6. This is done by using Lemma 3 to expand the terms and . We focus our analysis on bounding . The other term can be bounded in an identical manner^{2}^{2}2In particular, an analogue of Lemma 7 holds 7 holds where we replace with , with the value function corresponding to policy in the MDP , and the Gaussian noise with the fictitious noise terms ., so we omit this analysis.
Lemma 7.
Let . Then for any ,
The remaining lemmas complete the proof. At each stage, RLSVI adds Gaussian noise with standard deviation no larger than . Ignoring extremely low probability events, we expect, and hence . The proof of this Lemma makes this precise by applying appropriate maximal inequalities.
Lemma 8.
The next few lemmas are essentially a consequence of analysis in [13], and many subsequent papers. We give proof sketches in the appendix. The main idea is to apply known concentration inequalities to bound , or in terms of either or . The pigeonhole principle gives and .
Lemma 9.
Lemma 10.
Lemma 11.
Acknowledgments.
Much of my understanding of randomized value functions comes from a collaboration with Ian Osband, Ben Van Roy, and Zheng Wen. Mark Sellke and Chao Qin each noticed the same error in the proof of Lemma 6 in the initial draft of this paper. The lemma has now been revised. I am extremely grateful for their careful reading of the paper.
References
 Abeille et al. [2017] Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited. Electronic Journal of Statistics, 11(2):5165–5197, 2017.

Agrawal and Goyal [2013]
Shipra Agrawal and Navin Goyal.
Thompson sampling for contextual bandits with linear payoffs.
In
International Conference on Machine Learning
, pages 127–135, 2013.  Agrawal and Jia [2017] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worstcase regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.

Asmuth et al. [2009]
John Asmuth, Lihong Li, Michael L Littman, Ali Nouri, and David Wingate.
A bayesian sampling approach to exploration in reinforcement
learning.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pages 19–26. AUAI Press, 2009.  Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 263–272. JMLR. org, 2017.
 Azizzadenesheli et al. [2018] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep qnetworks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018.
 Boucheron et al. [2013] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 Brafman and Tennenholtz [2002] Ronen I Brafman and Moshe Tennenholtz. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
 Burda et al. [2019] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. 2019.
 Dann and Brunskill [2015] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixedhorizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
 Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
 Fortunato et al. [2018] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. 2018.
 Jaksch et al. [2010] T. Jaksch, R. Ortner, and P. Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
 Jin et al. [2018] Chi Jin, Zeyuan AllenZhu, Sebastien Bubeck, and Michael I Jordan. Is qlearning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
 Kakade et al. [2003] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
 Kearns and Singh [2002] Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 Lagoudakis and Parr [2003] Michail G Lagoudakis and Ronald Parr. Leastsquares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
 Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.
 Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
 Osband et al. [2016a] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016a.
 Osband et al. [2016b] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pages 2377–2386, 2016b.
 Osband et al. [2017] Ian Osband, Benjamin Van Roy, Daniel Russo, and Zheng Wen. Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608, 2017.
 Osband et al. [2018] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
 Strehl et al. [2006] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac modelfree reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
 Strehl et al. [2009] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
 Strens [2000] Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943–950, 2000.
 Touati et al. [2018] Ahmed Touati, Harsh Satija, Joshua Romoff, Joelle Pineau, and Pascal Vincent. Randomized value functions via multiplicative normalizing flows. arXiv preprint arXiv:1806.02315, 2018.
 Tziortziotis et al. [2019] Nikolaos Tziortziotis, Christos Dimitrakakis, and Michalis Vazirgiannis. Randomised bayesian leastsquares policy iteration. arXiv preprint arXiv:1904.03535, 2019.
 Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. HewlettPackard Labs, Tech. Rep, 2003.
Appendix A Omitted Proofs
a.1 Proof of Lemma 2
See 2
Proof.
The following construction is the standard way concentration inequalities are applied in bandit models and tabular reinforcement learning. See the discussion of what Lattimore and Szepesvári [18] calls a “stack of rewards” model in Subsection 4.6.
For every tuple
, generate two i.i.d sequences of random variables
and . Here denotes the reward and denotes the state transition generated from the th time action is played in state , period . Set
Comments
There are no comments yet.