Markov Decision Process (MDP) is one of the most standard models studied in the reinforcement learning that can be denoted by the tuple , where is the state space, is the action space, is the reward function111In general, the reward can be stochastic. Here for simplicity we assume the reward is deterministic and known throughout the paper, which is a common assumption in the literature (, Jin et al., 2018, 2020; Kakade et al., 2020). (where denotes the set of non-negative real numbers), is the transition and is an initial state distribution and is the horizon222Our method can be generalized to infinite horizon case, see Section LABEL:sec:practical_alg for the detail. (i.e. the length of each episode). A (potentially non-stationary) policy can be defined as where . Following the standard notation, we define the value function and the action-value function (i.e. the function) , which are the expected cumulative rewards under transition when executing policy starting from and . With these two definitions at hand, it is straightforward to show the following Bellman equation:
Reinforcement learning aims at finding the optimal policy . It is well known that in the tabular setting when the state space and action space are finite, we can provably identify the optimal policy with both sample-efficient and computational-efficient optimism-based methods ( Azar et al., 2017; Jin et al., 2018; Zhang et al., 2021) with the complexity proportion to . However, in practice, the cardinality of state and action space can be large or even infinite. Hence, we need to incorporate function approximation into the learning algorithm when we deal with such cases. The linear MDP (Jin et al., 2020) or low-rank MDP (Agarwal et al., 2020; Modi et al., 2021) is the most well-known reinforcement learning model that can incorporate linear function approximation with theoretical guarantee, thanks to the following assumption on the transition:
where , are two feature maps and is a Hilbert space. The most essential observation for them is that, for any policy is linear w.r.t , due to the following observation (Jin et al., 2020):
serves as a sufficient representation for the estimation of, that can provide uncertainty estimation with standard linear model analysis and eventually lead to sample-efficient learning when is fixed and known to the agent (see Theorem 3.1 in Jin et al., 2020). However, we in general do not have such representation in advance333One exception is the tabular MDP, where we can choose that each state-action pair has exclusive non-zero element and correspondingly defined to make (1) hold. and we need to learn the representation from the data, which constraints the applicability of the algorithms derived with fixed and known representation.
2 Theoretical Guarantees
In this section, we provide theoretical results for tbd, showing that tbd can identify informative representation and as a result, near-optimal policy in a sample-efficient way. We first define the notation of regret. Assume at episode , the learner chooses the policy and observes a sequence . We define the regret of the first episodes (and define ) as:
We want to provide a regret upper bound that is sublinear in , as when increases, we collect more data that can help us build a much more accurate estimation on the representation, which should decrease the per-step regret and make the overall regret scale sublinear in
. As we consider the Thompson Sampling algorithm, we would like to study the expected regret, which takes the prior into account.
Before we start, we first state the assumptions we use to derive our theoretical results.
Assumptions on the environment
Assumptions on the function class
In practice, we generally approximate with some complicated function approximators, so we focus on the setting where we want to find from a general function class that need not to be linear with certain feature map. This can be helpful for MuJoCo benchmarks that have angle, angular velocity and torque of the agent in the raw state, which we don’t know how to construct the feature map to make linear. We first state some necessary definitions and assumptions on . [-norm of functions] Define Notice that it is not the commonly used norm for the function, but it suits our purpose well. [Bounded Output] We assume that , . [Realizability] We assume the ground truth dynamic function . We then define the notion of covering number, which will be helpful in our algorithm derivation.
[Covering Number (Wainwright, 2019)] An -cover of with respect to a metric is a set , such that , there exists , . The -covering number is the cardinality of the smallest -cover, denoted as . [Bounded Covering Number] We assume that .
Basically, Assumption 2.1 means the the transition dynamic never pushes the state far from the origin, which holds widely in practice. Assumption 2.1 guarantees that we can find the exact in , or we will always suffer from the error induced by model mismatch. Assumption 2.1 ensures that we can estimate with small error when we have sufficient number of observations.
Besides the bounded covering number, we also need an additional assumption on bounded eluder dimension, which is defined in the following: [-dependency (Osband and Van Roy, 2014)] A state-action pair is -dependent on with respect to , if satisfying satisfies that . Furthermore, is said to be -independent of with respect to if it is not -dependent on . [Eluder Dimension (Osband and Van Roy, 2014)] We define the eluder dimension as the length of the longest sequence of elements in , such that , every element is -independent of its predecessors.
Intuitively, eluder dimension illustrates the number of samples we need to make our prediction on unseen data accurate. If the eluder dimension is unbounded, then we cannot make any meaningful prediction on unseen data, even we have large amounts of samples. Hence, to make the learning possible, we need the following bounded eluder dimension assumption. [Bounded Eluder Dimension] We assume .
2.2 Main Result
where represents the order up to logarithm factors. For finite dimensional function class, and should be scaled like , hence our upper bound is sublinear in . The proof is in Appendix LABEL:sec:technical_proof. Here we briefly sketch the proof idea.
We first construct an equivalent UCB algorithm (see Appendix LABEL:sec:ucb) and bound for it. Then by the conclusion from Russo and Van Roy (2013, 2014); Osband and Van Roy (2014), we can directly translate the upper bound on from UCB algorithm to an upper bound on of TS algorithm.
With the optimism, we know for episode , , where is the value function of policy under the model introduced in the UCB algorithm. Hence, the regret at episode can be bounded by , which is the value difference of the policy under the two models and , that can be bounded by (see Lemma LABEL:lem:simulation for the details), which means when the estimated model is close to the real model , the policy obtained by planning on will only suffer from a small regret. With Cauchy-Schwartz inequality, we only need to bound . This term can be handled via Lemma LABEL:lem:width_sum_bound. With some additional technical steps, we can obtain the upper bound on for the UCB algorithm, and hence the upper bound on for the TS algorithm. ∎
Kernelized Non-linear Regulator
Notice that, for the linear function class where is a fixed and known feature map of certain RKHS444Note that, the RKHS here is the Hilbert space that contains with the feature from some fixed and known kernel, It is different from the RKHS we introduced in Section LABEL:sec:algorithm, that contains with the feature where is the Gaussian kernel., when the feature and the parameters are bounded, the logarithm covering number can be bounded by , and the eluder dimension can be bounded by (see Appendix LABEL:sec:linear_case for the detail, notice that we provide a tighter bound of the eluder dimension compared with the one derived in Osband and Van Roy (2014)). Hence, for linear function class, Theorem 2.2 can be translated into a regret upper bound of for sufficient large , that matches the results of Kakade et al. (2020)555Note that in (Kakade et al., 2020) is the number of episodes, and in (Kakade et al., 2020) can be viewed as when the per-step reward is bounded.. Moreover, for the case of linear bandits when , our bound can be translated into a regret upper bound of , that matches the lower bound (Dani et al., 2008) up to logarithmic terms.
only contains linear functions w.r.t some known feature map, which constrains its application in practice. We instead, consider the general function approximation, which makes our algorithm applicable for more complicated models like deep neural networks. Meanwhile, the regret bound fromOsband and Van Roy (2014) depends on a global Lipschitz constant for the value function, which can be hard to quantify with either theoretical or empirical method. Instead, our regret bound gets rid of such dependency on the Lipschitz constant with the simulation lemma that carefully exploit the noise structure.
3.1 Posterior Inference
Assume we have a parameterized model . We have some candidate posterior inference methods (for Bayesian neural networks):
Stochastic Gradient Langevin Dynamics (Welling and Teh, 2011): . Some thing need to notice:
In theory we need , but in practice we can use different regimes as you like.
Generally we use a Gaussian prior on the weight, i.e. . The prior density and gradient can be correspondingly calculated.
The likelihood is just , where is the mini-batch of the data. We can also add an inverse-Wishart prior on the output covariance.
In practice we need several chains to have good performance.
Stein Spectral Gradient Estimator in Function Space (Sun et al., 2019): TODO: Illustrate the details later.
Function Space Particle Optimization (Wang et al., 2019): We initialize the same parameter model with different sets of parameters, denoted as . TODO: Illustrate the details later.
Variational Inference in the weight space?
3.2 Value Iteration
Assume we have been provided a fixed model . We want to identify optimal action corresponding to . Notice that we can treat as a simulator. TODO: shall we assume the reward is known? We use as the feature where is the Gaussian kernel with bandwidth , and in practice we can use random feature approximation, i.e. sample , , and where is the number of random feature we want to use.
We run the value iteration in the following way:
Parameterize the function as or .
Find a batch of (we can also use samples in the replay buffer. If the reward is not known we can only use samples in the replay buffer.)
Perform the following optimization:
until convergence, where the expectation can be approximated with Monte Carlo sampling.
3.3 Interaction with the environment
After we finish the value iteration, we just perform the optimal policy induced by (maybe with fixed number of steps), and put the collected data in the replay buffer, then infer the posterior again.
-  (2020) Flambe: structural complexity and representation learning of low rank mdps. arXiv preprint arXiv:2006.10814. Cited by: §1.
Minimax regret bounds for reinforcement learning.
International Conference on Machine Learning, pp. 263–272. Cited by: §1, §2.1.
-  (2008) Stochastic linear optimization under bandit feedback. Cited by: §2.2.
-  (2018) Is q-learning provably efficient?. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4868–4878. Cited by: §1, §2.1, footnote 1.
-  (2020) Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137–2143. Cited by: §1, §2.1, footnote 1.
-  (2020) Information theoretic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466. Cited by: §2.2, §2.2, §2.2, footnote 1, footnote 5.
-  (2021) Model-free representation learning and exploration in low-rank mdps. arXiv preprint arXiv:2102.07035. Cited by: §1.
-  (2014) Model-based reinforcement learning and the eluder dimension. Advances in Neural Information Processing Systems 27, pp. 1466–1474. Cited by: §2.1, §2.2, §2.2, §2.2, §2.2.
-  (2013) Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems. Cited by: §2.2.
-  (2014) Learning to optimize via posterior sampling. Mathematics of Operations Research 39 (4), pp. 1221–1243. Cited by: §2.2.
-  (2019) Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779. Cited by: 2nd item.
-  (2019) High-dimensional statistics: a non-asymptotic viewpoint. Vol. 48, Cambridge University Press. Cited by: §2.1.
-  (2019) Function space particle optimization for bayesian neural networks. arXiv preprint arXiv:1902.09754. Cited by: 3rd item.
-  (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Cited by: 1st item.
-  (2021) Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pp. 4528–4531. Cited by: §1.