## 1 Introduction

Reinforcement learning methods with deep neural networks as function approximators have recently demonstrated prominent success in solving complex and observations rich tasks like games

(Mnih et al., 2015; Silver et al., 2016), simulated control problems (Todorov et al., 2012; Lillicrap et al., 2015; Mordatch et al., 2016) and a range of robotics tasks (Christiano et al., 2016; Tobin et al., 2017). A common aspect in most of these success stories is the use of simulation. Arguably, given a simulator of the real environment, it is possible to use RL to learn a near-optimal policy from (usually a large amount of) simulation data. If the simulator is highly accurate, the learned policy should also perform well in the real environment.Apart from some cases where the true environment and the simulator coincide (e.g., in game playing) or a nearly perfect simulator can be created from the law of physics (e.g., in simple control problems), in general we will need to construct the simulator using data from the real environment, making the overall approach an instance of *model-based RL*. As the algorithms for learning from simulated experience mature (which is what the RL community has mostly focused on), the bottleneck has shifted to the creation of a good simulator. *How can we learn a good model of the world from interaction experiences?*

A popular approach for meeting this challenge is to learn using a wide variety of simulators, which imparts robustness and adaptivity to the learned policies. Recent works have demonstrated the benefits of using such an ensemble of models, which can be used to either transfer policies from simulated to real-world domains, or to simply learn robust policies (Andrychowicz et al., 2018; Tobin et al., 2017; Rajeswaran et al., 2017)

. Borrowing the motivation from these empirical works, we notice that the process of learning a simulator inherently includes various choices like inductive biases, data collection policy, design aspects etc. As such, instead of relying on a sole approximate model for learning in simulation, interpolating between models obtained from different sources can provide better approximation of the real environment. Previous works like

Buckman et al. (2018); Lee et al. (2019); Kurutach et al. (2018) have also demonstrated the effectiveness of using an ensemble of models for decreasing modelling error or its effect thereof during learning.In this paper, we consider building an approximate model of the real environment from interaction data using a set (or *ensemble*) of possibly inaccurate models, which we will refer to as the *base models*. The simplest way to combine the base models is to take a weighted combination, but such an approach is rather limited. For instance, each base model might be accurate in certain regions of the state space, in which case it is natural to consider a state-dependent mixture. We consider the problem of learning in such a setting, where one has to identify an appropriate combination of the base models through real-world interactions, so that the induced policy performs well in the real environment. The data collected through interaction with the real world can be a precious resource and, therefore, we need the learning procedure to be sample-efficient. Our main result is an algorithm that enjoys polynomial sample complexity guarantees, where the polynomial has no dependence on the size of the state and action spaces. We also study a more challenging setting where the featurization of states for learning the combination is unknown and has to be discovered from a large set of candidate features.

##### Outline.

## 2 Setting and Notation

We consider episodic Markov decision processes (MDP). An MDP

is specified by a tuple , where is the state space and is the action space. denotes the transition kernel describing the system dynamics and is the per-timestep reward function . The agent interacts with the environment for a fixed number of timesteps, , which determines the horizon of the problem. The initial state distribution is . The agent’s goal is to find a policy which maximizes the value of the policy:where the value function at step is defined as:

Here we use “” to imply that the sequence of states are generated according to the dynamics of . A policy is said to be optimal for if it maximizes the value . We denote such a policy as and its value as . We use to denote the model of the true environment, and use and as shorthand for and , respectively.

In our setting, the agent is given access to a set of base MDPs . They share the same , and only differ in and . In addition, a feature map is given which maps state-action pairs to

-dimensional real vectors. Given these two objects, we consider the class of all models which can be obtained from the following state-dependent linear combination of the base models:

###### Definition 2.1 (Linear Combination of Base Models).

For given model ensemble and the feature map , we consider models parametrized by with the following transition and reward functions:

We will use to denote such a model for any parameter with .

For now, let’s assume that there exists some such that , i.e., the true environment can be captured by our model class; we will relax this assumption shortly.

To develop intuition, consider a simplified scenario where and . In this case, the matrix becomes a stochastic vector, and the true environment is approximated by a linear combination of the base models.

###### Example 2.2 (Global convex combination of models).

If the base models are combined using a set of constant weights , then this is a special case of Definition 2.1 where and each state’s feature vector is .

In the more general case of

, we allow the combination weights to be a linear transformation of the features, which are

, and hence obtain more flexibility in choosing different combination weights in different regions of the state-action space. A special case of this more general setting is when corresponds to a partition of the state-action space into multiple groups, and the linear combination coefficients are constant within each group.###### Example 2.3 (State space partition).

Let be a partition (i.e., are disjoint). Let for all where is the indicator function. This satisfies the condition that , and when combined with a set of base models, forms a special case of Definition 2.1.

Goal.

We consider the popular PAC learning objective: with probability at least

, the algorithm should output a policy with value by collecting episodes of data. Importantly, here the sample complexity is not allowed to depend on or . However, the assumption that lies the class of linear models can be limiting and, therefore, we will allow some approximation error in our setting as follows:(1) | ||||

(2) |

We denote the optimal parameter attaining this value by . The case of represents the *realizable* setting where for some . When , we cannot guarantee returning a policy with the value close to , and will have to pay an additional penalty term proportional to the approximation error , as is standard in RL theory.

##### Further Notations

Let be a shorthand for , the optimal policy in . When referring to value functions and state-action distributions, we will use the superscript to specify the policy and use the subscript to specify the MDP in which the policy is evaluated. For example, we will use to denote the value of (the optimal policy for model ) when evaluated in model starting from timestep . The term denotes the state-action distribution induced by policy at timestep in the MDP . Furthermore, we will write and when the evaluation environment is . For conciseness, and will denote the optimal (state- and Q-) value functions in model at step (e.g., ). The expected return of a policy in model is defined as:

(3) |

We assume that the total reward lies in almost surely in all MDPs of interest and under all policies. Further, whenever used, any value function at step (e.g., ) evaluates to for any policy and any model.

## 3 Related Work

##### MDPs with low-rank transition matrices

Yang and Wang (2019, 2019); Jin et al. (2019) have recently considered structured MDPs whose transition matrices admit low-rank factorization, and the left matrix in the factorization are known to the learner as state-action features (corresponding to our ). Their environmental assumption is a special case of ours, where the transition dynamics of each base model is *independent* of and , i.e., each base MDP can be fully specified by a single density distribution over . This special case enjoys many nice properties, such as the value function of any policy is also linear in state-action features, and the linear value-function class is closed under the Bellman update operators, which are heavily exploited in their algorithms and analyses.
In contrast, none of these properties hold under our more general setup, yet we are still able to provide sample efficiency guarantees. That said, we do note that the special case allows these recent works to obtain stronger results: their algorithms are both statistically and computationally efficient (ours is only statistically efficient), and some of these algorithms work without knowing the base distributions.^{1}^{1}1In our setting, not knowing the base models immediately leads to hardness of learning, as it is equivalent to learning a general MDP without any prior knowledge even when . This requires sample complexity (Azar et al., 2012), which is vacuous as we are interested in solving problems with arbitrarily large state and action spaces.

##### Contextual MDPs

Abbasi-Yadkori and Neu (2014); Modi et al. (2018); Modi and Tewari (2019) consider a setting similar to our Example 2.2, except that the linear combination coefficients are visible to the learner and the base models are unknown. Therefore, despite the similarity in environmental assumptions, the learning objectives and the resulting sample complexities are significantly different (e.g., their guarantees depend on and ).

##### Bellman rank

Jiang et al. (2017) have identified a structural parameter called Bellman rank for exploration under general value-function approximation, and devised an algorithm called OLIVE whose sample complexity is polynomial in the Bellman rank. A related notion is the witness rank (the model-based analog of Bellman rank) proposed by Sun et al. (2019). While our algorithm and analyses draw inspiration from these works, our setting does not obviously yield low Bellman rank or witness rank.^{2}^{2}2In contrast, the low-rank MDPs considered by Yang and Wang (2019, 2019); Jin et al. (2019) do admit low Bellman rank and low witness rank. We will also provide a more detailed comparison to Sun et al. (2019), whose algorithm is most similar to ours among the existing works, in Section 4.

##### Mixtures/ensembles of models

The closest work to our setting is the multiple model-based RL (MMRL) architecture proposed by Doya et al. (2002) where they also decompose a given domain as a convex combination of multiple models. However, instead of learning the combination coefficients for a given ensemble, their method trains the model ensemble and simultaneously learns a mixture weight for each *base model* as a function of state features. Their experiments demonstrate that each model specialized for different domains of the state space where the environment dynamics is predictable, thereby, providing a justification for using convex combination of models for simulation. Further, the idea of combining different models is inherently present in Bayesian learning methods where a posterior approximation of the real environment is iteratively refined using interaction data. For instance, Rajeswaran et al. (2017) introduce the EPOpt algorithm which uses an ensemble of simulated domains to learn robust and generalizable policies. During learning, they adapt the ensemble distribution (convex combination) over source domains using data from the target domain to progressively make it a better approximation. Similarly, Lee et al. (2019) combine a set of parameterized models by adaptively refining the mixture distribution over the latent parameter space. Here, we study a relatively simpler setting where a finite number of such base models are combined and give a frequentist sample complexity analysis for our method.

## 4 Algorithm and Main Results

In this section we introduce the main algorithm that learns a near-optimal policy in the aforementioned setup with a sample complexity. We will first give the intuition behind the algorithm, and then present the formal sample complexity guarantees. Due to space constraints, we present the complete proof in the appendix. For simplicity, we will describe the intuition for the realizable case with (). The pseudocode (Algorithm 1) and the results are, however, stated for the general case of .

*optimistic model*and set to

(5) | ||||

(6) |

(7) |

At a high level, our algorithm proceeds in iterations , and gradually refines a *version space* of plausible parameters. Our algorithm follows an *explore-or-terminate* template and in each iteration, either chooses to explore with a carefully chosen policy or terminates with a near-optimal policy. For exploration in the -th iteration, we collect trajectories following some exploration policy (Line 8). A key component of the algorithm is to extract knowledge about from these trajectories. In particular, for every , the bag of samples may be viewed as an unbiased draw from the following distribution

(8) |

The situation for rewards is similar and will be omitted in the discussion. So in principle we could substitute in Eq.(8) with any candidate , and if the resulting distribution differs significantly from the real samples , we can assert that and eliminate from the version space. However, the state space can be arbitrarily large in our setting, and comparing state distributions directly can be intractable. Instead, we project the state distribution in Eq.(8) using a (non-stationary) discriminator function (which will be chosen later) and consider the following scalar property

(9) |

which can be effectively estimated by

(10) |

Since we have projected states onto , Eq.(10

) is the average of scalar random variables and enjoys state-space-independent concentration. Now, in order to test the validity of a parameter

in a given version space, we compare the estimate in eq. 10 with the prediction given by , which is:(11) |

As we consider a linear model class, by using linearity of expectations, Eq.(11) may also be written as:

(12) | ||||

(13) |

where denotes for any two matrices and . In eq. 13, is a function that maps to a dimensional vector with each entry being

(14) |

The intuition behind Eq.(12) is that for each fixed state-action pair , the expectation in Eq.(9) can be computed by first taking expectation of over the reward and transition distributions of each of the base models—which gives —and then aggregating the results using the combination coefficients. Rewriting Eq.(12) as Eq.(13), we see that Eq.(9) can also be viewed as a linear measure of , where the measurement matrix is again . Therefore, by estimating this measurement matrix and the outcome (Eq.(10)), we obtain an approximate linear equality constraint over and can eliminate all candidate that violates such constraints. By using a finite sample concentration bound over the inner product, we get a linear inequality constraint to update the version space (Eq.(7)).

The remaining concern is to choose the exploration policy and the discriminator function to ensure that the linear constraint induced in each iteration is significantly different from the previous ones and induces deep cuts in the version space. We guarantee this by choosing and ^{3}^{3}3We use the simplified notation for ., where is the *optimistic model* as computed in Line 3. That is, predicts the highest optimal value among all candidate models in .
Following a terminate-or-explore argument, we show that as long as is suboptimal, the linear constraint induced by our choice of and will significantly reduce the volume of the version space, and the iteration complexity can be bounded as poly by an ellipsoid argument similar to that of Jiang et al. (2017). Similarly, the sample size needed in each iteration only depends polynomially on and and incurs no dependence on or , as we have summarized high-dimensional objects such as (function over states) using low-dimensional quantities such as (vector of length ).

The bound on the number of iterations and the number of samples needed per iteration leads to the following sample complexity result:

###### Theorem 4.1 (PAC bound for Alg. 1).

In Algorithm 1, if and where , with probability at least , the algorithm terminates after using at most

(15) |

trajectories and returns a policy with a value .

By setting and to appropriate values, we obtain the following sample complexity bounds as corollaries:

###### Corollary 4.2 (Sample complexity for partitions).

Since the state-action partitioning setting (Example 2.3) is subsumed by the general setup, the sample complexity is again:

(16) |

###### Corollary 4.3 (Sample complexity for global convex combination).

When base models are combined without any dependence on state-action features (Example 2.2), the setting is special case of the general setup with . Thus, the sample complexity is:

(17) |

Our algorithm, therefore, satisfies the requirement of learning a near-optimal policy without any dependence on the or . Moreover, we can also account for the approximation error but also incur a cost of in the performance guarantee of the final policy. As we use the projection of value functions through the linear model class, we do not model the complete dynamics of the environment. This leads to an additive loss of in value in addition to the best achievable value loss of (see Corollary A.4 in the appendix).

##### Comparison to OLIME (Sun et al., 2019)

Our Algorithm 1 shares some structural similarity with the OLIME algorithm proposed by Sun et al. (2019), but there are also several important differences. First of all, OLIME in each iteration will pick a time step and take uniformly random actions during data collection, and consequently incur polynomial dependence on in its sample complexity. In comparison, our main data collection step (Line 8) never takes a random deviation, and we do not pay any dependence on the cardinality of the action space. Secondly, similar to how we project the transition distributions onto a discriminator function (Eq.(8) and (9)), OLIME projects the distributions onto a *static discriminator class* and uses the corresponding integral probability metric (IPM) as a measure of model misfit. In our setting, however, we find that the most efficient and elegant way to extract knowledge from data is to use a *dynamic* discriminator function, , which changes from iteration to iteration and depends on the previously collected data. Such a choice of discriminator function allows us to make direct cuts on the parameter space , whereas OLIME can only make cuts in the value prediction space.

##### Computational Characteristics

In each iteration, our algorithm computes the optimistic policy within the version space. Therefore, we rely on access to the following *optimistic planning oracle*:

###### Assumption 4.4 (Optimistic planning oracle).

We assume that when given a version space , we can obtain the optimistic model through a single oracle call for .

It is important to note that any version space that we deal with is always an intersection of half-spaces induced by the linear inequality constraints. Therefore, one would hope to solve the optimistic planning problem in a computationally efficient manner given the nice geometrical form of the version space. However, even for a finite state-action space, we are not aware of any efficient solutions as the planning problem induces bilinear and non-convex constraints despite the linearity assumption. Many recently proposed algorithms also suffer from such a computational difficulty (Jiang et al., 2017; Dann et al., 2018; Sun et al., 2019).

Further, we also assume that for any given , we can compute the optimal policy and its value function: our elimination criteria in Eq. 7 uses estimates and which in turn depend on the value function. This requirement corresponds to a standard planning oracle, and aligns with the motivation of our setting, as we can delegate these computations to any learning algorithm operating in the simulated environment with the given combination coefficient. Our algorithm, instead, focuses on careful and systematic exploration to minimize the sample complexity in the real world.

## 5 Model Selection With Candidate Partitions

In the previous section we showed that a near-optimal policy can be PAC-learned under our modeling assumptions, where the feature map is given along with the approximation error . In this section, we explore the more interesting and challenging setting where a realizable feature map is unknown, but we know that the realizable belongs to a candidate set , i.e., the true environment satisfies our modeling assumption in Definition 2.1 under for some with . Note that Definition 2.1 may be satisfied by multiple ’s; for example, adding redundant features to an already realizable still yields a realizable feature map. In such cases, we consider to be the most succinct feature map among all realizable ones, i.e., the one with the lowest dimensionality. Let denote the dimensionality of , and .

One obvious baseline in this setup is to run Algorithm 1 with each and select the best policy among the returned ones. This leads to a sample complexity of roughly (only the dependence on is considered), which can be very inefficient: When there exists such that , we pay for which is much greater than the sample complexity of ; When are relatively uniform, we pay a linear dependence on , preventing us from competing with a large set of candidate feature maps.

So the key result we want to obtain is a sample complexity that scales as , possibly with a mild multiplicative overhead dependence on and/or (e.g., and ).

##### Hardness Result for Unstructured

Unfortunately, we show that this is impossible when is unstructured via a lower bound. In the lower bound construction, we have an exponentially large set of candidate feature maps, all of which are state space partitions. Each of the partitions has trivial dimensionalities (, ), but the sample complexity of learning is exponential, which can only be explained away as .

###### Proposition 5.1.

For the aforementioned problem of learning an -optimal policy using a candidate feature set of size, no algorithm can achieve sample complexity for any constant .

On a separate note, besides providing formal justification for the structural assumption we will introduce later, this proposition is of independent interest as it also sheds light on the hardness of model selection with state abstractions. We discuss the further implications in Appendix B.

###### Proof of Proposition 5.1.

We construct a linear class of MDPs with two base models and in the following way: Consider a complete tree of depth with a branching factor of . The vertices forming the state space of and and the two outgoing edges in each state are the available actions. Both MDPs share the same deterministic transitions and each non-leaf node yields reward. Every leaf node yields reward in and in . Now we construct a candidate partition set of size : for , the -th leaf node belongs to one equivalence class while all other leaf nodes belong to the other. (Non-leaf nodes can belong to either class as and agree on their transitions and rewards.)

Observe that the above model class contains a finite family of MDPs, each of which only has 1 rewarding leaf node. Concretely, the MDP whose -th leaf is rewarding is exactly realized under the feature map , whose corresponding

is the identity matrix: the

-th leaf yields reward as in , and all other leaves yield reward as in . Learning in this family of MDPs is provably hard (Krishnamurthy et al., 2016), as when the rewarding leaf is chosen adversarially, the learner has no choice but to visit almost all leaf nodes to identify the rewarding leaf as long as is below a constant threshold. The proposition follows from the fact that in this setting , , is a constant, , but the sample complexity is . ∎This lower bound shows the necessity of introducing structural assumptions in . Below, we consider a particular structure of *nested partitions* that is natural and enables sample-efficient learning. Similar assumptions have also been considered in the state abstraction literature (e.g., Jiang et al., 2015).

##### Nested Partitions as a Structural Assumption

Consider the case where every is a partition. W.l.o.g. let . We assume is nested, meaning that ,

While this structural assumption almost allows us to develop sample-efficient algorithms, it is still insufficient as demonstrated by the following hardness result.

###### Proposition 5.2.

Fixing , there exist base models and and *nested* state space partitions and , such that it is information-theoretically impossible for any algorithm to obtain poly sample complexity when an adversary chooses an MDP that satisfies our environmental assumption (Definition 2.1) under either or .

###### Proof.

We will again use an exponential tree style construction to prove the lower bound. Specifically, we construct two MDPs and which are obtained by combining two base MDPs and using two different partitions and . The specification of and is exactly the same as in the proof of Proposition 5.1. We choose to be a partition of size , where all nodes are grouped together. has size , where each leaf node belongs to a separate group. (As before, which group the inner nodes belong to does not matter.) and are obviously nested. We construct that is realizable under by randomly choosing a leaf and setting the weights for the convex combination as for that leaf; for all other leafs, the weights are . This is equivalent to randomly choosing from a set of MDPs, each of which has only one *good* leaf node yielding a random reward drawn from instead of . In contrast, is such that all leaf nodes yield reward, which is realizable under with weights .

Observe that and are exactly the same as the constructions in the proof of the multi-armed bandit lower bound by (Auer et al., 2002) (the number of arms is ), where it has been shown that distinguishing between and takes samples. Now assume towards contradiction that there exists an algorithm that achieves poly complexity; let be the specific polynomial in its guarantee. After trajectories are collected, the algorithm must stop if the true environment is to honor the sample complexity guarantee (since , ), and proceed to collect more trajectories if is the true environment (since ). Making this decision essentially requires distinguishing between and using trajectories, which contradicts the known hardness result from Auer et al. (2002). This proves the statement. ∎

Essentially, the lower bound creates a situation where , and the nature may adversarially choose a model such that either or is realizable. If is realizable, the learner is only allowed a small sample budget and cannot fully explore with , and if is not realizable the learner must do the opposite. The information-theoretic lower bound shows that it is fundamentally hard to distinguish between the two situations: Once the learner explores with , she cannot decide whether she should stop or move on to without collecting a large amount of data.

This hardness result motivates our last assumption in this section, that the learner knows the value of (a scalar) as side information. This way, the learner can compare the value of the returned policy in each round to and effectively decide when to stop. This naturally leads to our Algorithm 2 that uses a doubling scheme over , with the following sample complexity guarantee.

###### Theorem 5.3.

When Algorithm 2 is run with the input , with probability at least , it returns a near-optimal policy with using at most samples.

###### Proof.

In Algorithm 2, for each partition , we run Algorithm 1 until termination or until the sample budget is exhausted. By union bound it is easy to verify that with probability at least , all calls to Algorithm 1 will succeed and the Monte-Carlo estimation of the returned policies will be -accurate, and we will only consider this success event in the rest of the proof. When the partition under consideration is realizable, we get , therefore

so the algorithm will terminate after considering a realizable . Similarly, whenever the algorithm terminates, we have . This is because

where the last inequality holds thanks to the termination condition of Algorithm 2, which relies on knowledge of . The total number of iterations of the algorithm is at most . Therefore, by taking a union bound over all possible iterations, the sample complexity is

##### Discussion.

Model selection in online learning—especially in the context of sequential decision making—is generally considered very challenging. There has been relatively limited work in the generic setting until recently for some special cases. For instance, Foster et al. (2019) consider the model selection problem in linear contextual bandits with a sequence of nested policy classes with dimensions . They consider a similar goal of achieving sub-linear regret bounds which only scale with the optimal dimension . In contrast to our result, they do not need to know the achievable value in the environment and give no-regret learning methods in the *knowledge-free* setting. However, this is not contradictory to our lower bound: Due to the extremely delayed reward signal, our construction is equivalent to a multi-armed bandit problem with arms. Our negative result (Proposition 5.2) shows a lower bound on sample complexity which is exponential in horizon, therefore eliminating the possibility of sample efficient and knowledge-free model selection in MDPs.

## 6 Conclusion

In this paper, we proposed a sample efficient model based algorithms which learns a near-optimal policy by approximating the true environment via a feature dependent convex combination of a given ensemble. Our algorithm offers a sample complexity bound which is independent of the size of the environment and only depends on the number of parameters being learnt. In addition, we also consider a model selection problem, show exponential lower bounds and then give sample efficient methods under natural assumptions. The proposed algorithm and its analysis relies on a linearity assumption and shares this aspect with existing exploration methods for rich observation MDPs. We leave the possibility of considering a richer class of convex combinations to future work. Lastly, our work also revisits the open problem of coming up with a computational and sample efficient model based learning algorithm.

#### Acknowledgements

This work was supported in part by a grant from the Open Philanthropy Project to the Center for Human-Compatible AI, and in part by NSF grant CAREER IIS-1452099. AT would also like to acknowledge the support of a Sloan Research Fellowship. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors.

## References

- Online learning in mdps with side information. arXiv preprint arXiv:1406.6812. Cited by: §3.
- Matrix regularization techniques for online multitask learning. Technical report Technical Report UCB/EECS-2008-138, EECS Department, University of California, Berkeley. External Links: Link Cited by: §A.1.
- Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §1.
- Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §5.
- On the sample complexity of reinforcement learning with a generative model. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1707–1714. Cited by: footnote 1.
- Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §1.
- Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §1.
- On oracle-efficient pac rl with rich observations. In Advances in Neural Information Processing Systems, pp. 1422–1432. Cited by: §4.
- Multiple model-based reinforcement learning. Neural computation 14 (6), pp. 1347–1369. Cited by: §3.
- Model selection for contextual bandits. arXiv preprint arXiv:1906.00531. Cited by: §5.
- Equivalence notions and model minimization in markov decision processes. Artificial Intelligence 147 (1-2), pp. 163–223. Cited by: Appendix B.
- Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1704–1713. Cited by: §A.1, Lemma A.6, §3, §4, §4.
- Abstraction selection in model-based reinforcement learning. In International Conference on Machine Learning, pp. 179–188. Cited by: §5.
- Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388. Cited by: §3, footnote 2.
- Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2-3), pp. 209–232. Cited by: Lemma A.2.
- PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems 29, pp. 1840–1848. Cited by: §5.
- Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, External Links: Link Cited by: §1.
- Bayesian policy optimization for model uncertainty. In International Conference on Learning Representations, External Links: Link Cited by: §1, §3.
- Towards a unified theory of state abstraction for mdps.. In ISAIM, Cited by: Appendix B.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
- Markov decision processes with continuous side information. In Algorithmic Learning Theory, pp. 597–618. Cited by: Lemma A.2, §3.
- Contextual markov decision processes using generalized linear models. arXiv preprint arXiv:1903.06187. Cited by: §3.
- Combining model-based policy search with online model learning for control of physical humanoids. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 242–248. Cited by: §1.
- EPOpt: learning robust neural network policies using model ensembles. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §3.
- Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
- Model-based rl in contextual decision processes: pac bounds and exponential improvements over model-free approaches. In Proceedings of the Thirty-Second Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA, pp. 2898–2933. External Links: Link Cited by: §3, §4, §4, §4.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §1, §1.
- Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1.
- Approximations of dynamic programs, i. Mathematics of Operations Research 3 (3), pp. 231–243. Cited by: Appendix B.
- Reinforcement leaning in feature space: matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389. Cited by: §3, footnote 2.
- Sample-optimal parametric q-learning using linearly additive features. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6995–7004. External Links: Link Cited by: §3, footnote 2.

## References

- Online learning in mdps with side information. arXiv preprint arXiv:1406.6812. Cited by: §3.
- Matrix regularization techniques for online multitask learning. Technical report Technical Report UCB/EECS-2008-138, EECS Department, University of California, Berkeley. External Links: Link Cited by: §A.1.
- Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §1.
- Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §5.
- On the sample complexity of reinforcement learning with a generative model. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1707–1714. Cited by: footnote 1.
- Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §1.
- Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §1.
- On oracle-efficient pac rl with rich observations. In Advances in Neural Information Processing Systems, pp. 1422–1432. Cited by: §4.
- Multiple model-based reinforcement learning. Neural computation 14 (6), pp. 1347–1369. Cited by: §3.
- Model selection for contextual bandits. arXiv preprint arXiv:1906.00531. Cited by: §5.
- Equivalence notions and model minimization in markov decision processes. Artificial Intelligence 147 (1-2), pp. 163–223. Cited by: Appendix B.
- Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1704–1713. Cited by: §A.1, Lemma A.6, §3, §4, §4.
- Abstraction selection in model-based reinforcement learning. In International Conference on Machine Learning, pp. 179–188. Cited by: §5.
- Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388. Cited by: §3, footnote 2.
- Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2-3), pp. 209–232. Cited by: Lemma A.2.
- PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems 29, pp. 1840–1848. Cited by: §5.
- Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, External Links: Link Cited by: §1.
- Bayesian policy optimization for model uncertainty. In International Conference on Learning Representations, External Links: Link Cited by: §1, §3.
- Towards a unified theory of state abstraction for mdps.. In ISAIM, Cited by: Appendix B.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
- Markov decision processes with continuous side information. In Algorithmic Learning Theory, pp. 597–618. Cited by: Lemma A.2, §3.
- Contextual markov decision processes using generalized linear models. arXiv preprint arXiv:1903.06187. Cited by: §3.
- Combining model-based policy search with online model learning for control of physical humanoids. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 242–248. Cited by: §1.
- EPOpt: learning robust neural network policies using model ensembles. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §3.
- Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
- Model-based rl in contextual decision processes: pac bounds and exponential improvements over model-free approaches. In Proceedings of the Thirty-Second Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA, pp. 2898–2933. External Links: Link Cited by: §3, §4, §4, §4.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §1, §1.
- Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1.
- Approximations of dynamic programs, i. Mathematics of Operations Research 3 (3), pp. 231–243. Cited by: Appendix B.
- Reinforcement leaning in feature space: matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389. Cited by: §3, footnote 2.
- Sample-optimal parametric q-learning using linearly additive features. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6995–7004. External Links: Link Cited by: §3, footnote 2.

## Appendix A Proofs from the main text

In this section, we provide a detailed proof as well as the key ideas used in the analysis. The proof uses an optimism based template which guarantees that either the algorithm terminates with a near-optimal policy or explores appropriately in the environment. We can show a polynomial sample complexity bound as the algorithm explores for a bounded number of iterations and the number of samples required in each iteration is polynomial in the desired parameters. We start with the key lemmas used in the analysis in Section A.1 with the final proof of the main theorem in Section A.2.

##### Notation.

As in the main text, we use for . The notation denotes the Frobenius norm . For any matrix in with columns , we will use as the group norm: .

### a.1 Key lemmas used in the analysis

For our analysis, we first define a term for any parameter which intuitively quantifies the model error at step :

(18) |

We start with the following lemma which allows us to express the value loss by using a model in terms of these per-step quantities.

###### Lemma A.1 (Value decomposition).

For any , we can write the difference in two values:

(19) |

###### Proof.

We start with the value difference on the lhs:

Unrolling the second expected value similarly till leads to the desired result. ∎

At various places in our analysis, we will use the well-known simulation lemma to compare the value of a policy across two MDPs:

###### Lemma A.2 (Simulation Lemma (Kearns and Singh, 2002; Modi et al., 2018)).

Let and be two MDPs with the same state-action space. If the transition dynamics and reward functions of the two MDPs are such that:

then, for every policy , we have:

(20) |

Now, we will first use the assumption about linearity to prove the following key lemma of our analysis:

###### Lemma A.3 (Decomposition of ).

If is the approximation error defined in eq. 2, then the quantity can be bounded as follows:

(21) |

where is a vector with the entry as .

###### Proof.

Using the definition of from eq. 18, we rewrite the term as:

Comments

There are no comments yet.