# Model-Based Imitation Learning with Accelerated Convergence

Sample efficiency is critical in solving real-world reinforcement learning problems, where agent-environment interactions can be costly. Imitation learning from expert advice has proved to be an effective strategy for reducing the number of interactions required to train a policy. Online imitation learning, a specific type of imitation learning that interleaves policy evaluation and policy optimization, is a particularly effective framework for training policies with provable performance guarantees. In this work, we seek to further accelerate the convergence rate of online imitation learning, making it more sample efficient. We propose two model-based algorithms inspired by Follow-the-Leader (FTL) with prediction: MoBIL-VI based on solving variational inequalities and MoBIL-Prox based on stochastic first-order updates. When a dynamics model is learned online, these algorithms can provably accelerate the best known convergence rate up to an order. Our algorithms can be viewed as a generalization of stochastic Mirror-Prox by Juditsky et al. (2011), and admit a simple constructive FTL-style analysis of performance. The algorithms are also empirically validated in simulation.

## Authors

• 24 publications
• 6 publications
• 10 publications
• 72 publications

Many modern methods for imitation learning and inverse reinforcement lea...
08/08/2020 ∙ by Oleg Arenz, et al. ∙ 0

• ### Convergence of Value Aggregation for Imitation Learning

Value aggregation is a general framework for solving imitation learning ...
01/22/2018 ∙ by Ching-An Cheng, et al. ∙ 0

• ### Neural Rate Control for Video Encoding using Imitation Learning

In modern video encoders, rate control is a critical component and has b...
12/09/2020 ∙ by Hongzi Mao, et al. ∙ 2

• ### Explaining Fast Improvement in Online Policy Optimization

Online policy optimization (OPO) views policy optimization for sequentia...
07/06/2020 ∙ by Xinyan Yan, et al. ∙ 0

• ### Transfer Learning for Prosthetics Using Imitation Learning

In this paper, We Apply Reinforcement learning (RL) techniques to train ...
01/15/2019 ∙ by Montaser Mohammedalamen, et al. ∙ 0

• ### Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

Thompson sampling (TS) has emerged as a robust technique for contextual ...
11/29/2020 ∙ by Hongseok Namkoong, et al. ∙ 0

• ### Nearly Minimax Optimal Adversarial Imitation Learning with Known and Unknown Transitions

This paper is dedicated to designing provably efficient adversarial imit...
06/19/2021 ∙ by Tian Xu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Imitation learning (IL) has recently received attention for its ability to speed up policy learning when solving reinforcement learning problems (RL) [1, 2, 3, 4, 5, 6]. Unlike pure RL techniques, which rely on uniformed random exploration to locally improve a policy, IL leverages prior knowledge about a problem in terms of expert demonstrations. At a high level, this additional information provides policy learning with an informed search direction toward the expert policy.

The goal of IL is to quickly learn a policy that can perform at least as well as the expert policy. Because the expert policy may be suboptimal with respect to the RL problem of interest, performing IL is often used to provide a good warm start to the RL problem, so that the number of interactions with the environment can be minimized. Sample efficiency is especially critical when learning is deployed in applications like robotics, where every interaction incurs real-world costs.

By reducing IL to an online learning problem, online IL [2] provides a framework for convergence analysis and mitigates the covariate shift problem encountered in batch IL [7, 8]. In particular, under proper assumptions, the performance of a policy sequence updated by Follow-the-Leader (FTL) can converge on average to the performance of the expert policy [2]. Recently, it was shown that this rate is sufficient to make IL more efficient than solving an RL problem from scratch [9].

In this work, we further accelerate the convergence rate of online IL. Inspired by the observation of Cheng and Boots [10] that the online learning problem of IL is not truly adversarial, we propose two MOdel-Based IL (MoBIL) algorithms, MoBIL-VI and MoBIL-Prox, that can achieve a fast rate of convergence. Under the same assumptions of Ross et al. [2], these algorithms improve on-average convergence to , e.g., when a dynamics model is learned online, where is the number of iterations of policy update.

The improved speed of our algorithms is attributed to using a model oracle to predict the gradient of the next per-round cost in online learning. This model can be realized, e.g., using a simulator based on a (learned) dynamics model, or using past demonstrations. We first conceptually show that this idea can be realized as a variational inequality problem in MoBIL-VI. Next, we propose a practical first-order stochastic algorithm MoBIL-Prox, which alternates between the steps of taking the true gradient and of taking the model gradient. MoBIL-Prox is a generalization of stochastic Mirror-Prox proposed by Juditsky et al. [11]

to the case where the problem is weighted and the vector field is unknown but learned online. In theory, we show that having a

weighting scheme is pivotal to speeding up convergence, and this generalization is made possible by a new constructive FTL-style regret analysis, which greatly simplifies the original algebraic proof [11]. The performance of MoBIL-Prox is also empirically validated in simulation.

## 2 Preliminaries

### 2.1 Problem Setup: RL and IL

Let and be the state and the action spaces, respectively. The objective of RL is to search for a stationary policy inside a policy class with good performance. This can be characterized by the stochastic optimization problem with expected cost111Our definition of corresponds to the average accumulated cost in the RL literature. defined below:

 minπ∈ΠJ(π),J(π)\coloneqqE(s,t)∼dπEa∼πs[ct(s,a)], (1)

in which , , is the instantaneous cost at time , is a generalized stationary distribution induced by executing policy , and is the distribution of action given state of . The policies here are assumed to be parametric. To make the writing compact, we will abuse the notation to also denote its parameter, and assume is a compact convex subset of parameters in some normed space with norm .

Based on the abstracted distribution , the formulation in (1) subsumes multiple discrete-time RL problems. For example, a -discounted infinite-horizon problem can be considered by setting

as a time-invariant cost and defining the joint distribution

, in which

denotes the probability (density) of state

at time under policy . Similarly, a -horizon RL problem can be considered by setting . Note that while we use the notation

, the policy is allowed to be deterministic; in this case, the notation means evaluation. For notational compactness, we will often omit the random variable inside the expectation (e.g. we shorten (

1) to ). In addition, we denote as the Q-function222For example, in a -horizon problem, , where denotes the distribution of future trajectory conditioned on . at time with respect to .

In this paper, we consider IL, which is an indirect approach to solving the RL problem. We assume there is a black-box oracle , called the expert policy, from which demonstration can be queried for any state . To satisfy the querying requirement, usually the expert policy is an algorithm; for example, it can represent a planning algorithm which solves a simplified version of (1), or some engineered, hard-coded policy (see e.g. [12]).

The purpose of incorporating the expert policy into solving (1) is to quickly obtain a policy that has reasonable performance. Toward this end, we consider solving a surrogate problem of (1),

 minπ∈ΠE(s,t)∼dπ[D(π∗s||πs)], (2)

where is a function that measures the difference between two distributions over actions (e.g. KL divergence; see Appendix B). Importantly, the objective in (2) has the property that and there is constant such that , it satisfies , in which denotes the set of natural numbers. By the Performance Difference Lemma [13], it can be shown that the inequality above implies [10],

 J(π)−J(π∗)≤Cπ∗Edπ[D(π∗||π)]. (3)

Therefore, solving (2) can lead to a policy that performs similarly to the expert policy .

### 2.2 Imitation Learning as Online Learning

The surrogate problem in (2) is more structured than the original RL problem in (1). In particular, when the distance-like function is given, and we know that is close to zero when is close to . On the contrary, in (1) generally can still be large, even if is a good policy (since it also depends on the state). This normalization property is crucial for the reduction from IL to online learning [10].

The reduction is based on observing that, with the normalization property, the expressiveness of the policy class can be described with a constant defined as,

 ϵΠ≥max{πn∈Π}minπ∈Π1N∑Nn=1Edπn[D(π∗||π)], (4)

for all , which measures the average difference between and with respect to and the state distributions visited by a worst possible policy sequence. Ross et al. [2] make use of this property and reduce (2) into an online learning problem by distinguishing the influence of on and on in (2). To make this transparent, we define a bivariate function

 F(π′,π)\coloneqqEdπ′[D(π∗||π)]. (5)

Using this bivariate function , the online learning setup can be described as follows: in round , the learner applies a policy and then the environment reveals a per-round cost

 fn(π)\coloneqqF(πn,π)=Edπn[D(π∗||π)]. (6)

Ross et al. [2] show that if the sequence is selected by a no-regret algorithm, then it will have good performance in terms of (2). For example, DAgger updates the policy by FTL, and has the following guarantee (cf. [10]), where we define the shorthand .

###### Theorem 2.1.

Let . If each is -strongly convex and , then DAgger has performance on average satisfying

 1N∑Nn=1J(πn)≤J(π∗)+Cπ∗(G22μflnN+1N+ϵΠ). (7)

First-order variants of DAgger based on Follow-the-Regularized-Leader (FTRL) have also been proposed by Sun et al. [5] and Cheng et al. [9], which have the same performance but only require taking a stochastic gradient step in each iteration without keeping all the previous cost functions (i.e. data) as in the original FTL formulation. The bound in Theorem 2.1 also applies to the expected performance of a policy randomly picked out of the sequence , although it does not necessarily translate into the performance of the last policy  [10].

## 3 Accelerating Il With Predictive Models

The reduction-based approach to solving IL has demonstrated sucess in speeding up policy learning. However, because interactions with the environment are necessary to approximately evaluate the per-round cost, it is interesting to determine if the convergence rate of IL can be further improved. A faster convergence rate will be valuable in real-world applications where data collection is expensive.

We answer this question affirmatively. We show that, by modeling333We define as a vector field the convergence rate of IL can potentially be improved by up to an order, where denotes the derivative to the second argument. The improvement comes through leveraging the fact that the per-round cost defined in (6) is not completely unknown or adversarial as it is assumed in the most general online learning setting. Because the same function is used in (6) over different rounds, the online component actually comes from the reduction made by Ross et al. [2], which ignores information about how changes with the left argument; in other words, it omits the variations of when changes [10]. Therefore, we argue that the original reduction proposed by Ross et al. [2], while allowing the use of (4) to characterize the performance, loses one critical piece of information present in the original RL problem: both the system dynamics and the expert are the same across different rounds of online learning.

We propose two model-based algorithms (MoBIL-VI and MoBIL-Prox) to accelerate IL. The first algorithm, MoBIL-VI, is conceptual in nature and updates policies by solving variational inequality (VI) problems [14]. This algorithm is used to illustrate how modeling through a predictive model can help to speed up IL, where is a model bivariate function.444While we only concern predicting the vector field , we adopt the notation to better build up the intuition, especially of MoBIL-VI; we will discuss other approximations that are not based on bivariate functions in Section 3.3. The second algorithm, MoBIL-Prox is a first-order method. It alternates between taking stochastic gradients by interacting with the environment and querying the model . We will prove that this simple yet practical approach has the same performance as the conceptual one: when is learned online and is realizable, e.g. both algorithms can converge in , in contrast to DAgger’s convergence. In addition, we show the convergence results of MoBIL under relaxed assumptions, e.g. allowing stochasticity, and provide several examples of constructing predictive models. (See Appendix A for a summary of notation.)

### 3.1 Performance and Average Regret

Before presenting the two algorithms, we first summarize the core idea of the reduction from IL to online learning in a simple lemma, which builds the foundation of our algorithms (proved in Appendix C.1).

###### Lemma 3.1.

For arbitrary sequences and , it holds that

 E[∑Nn=1wnJ(πn)w1:N]≤J(π∗)+Cπ∗(ϵwΠ+E[regretw(Π)w1:N])

where

is an unbiased estimate of

, , is given in Definition 4.1, and the expectation is due to sampling .

In other words, the on-average performance convergence of an online IL algorithm is determined by the rate of the expected weighted average regret . For example, in DAgger, the weighting is uniform and is in ; by Lemma 3.1 this rate directly proves Theorem 2.1.

### 3.2 Algorithms

From Lemma 3.1, we know that improving the regret bound implies a faster convergence of IL. This leads to the main idea of MoBIL-VI and MoBIL-Prox: to use model information to approximately play Be-the-Leader (BTL) [15], i.e. . To understand why playing BTL can minimize the regret, we recall a classical regret bound of online learning.555We use notation and to distinguish general online learning problems from online IL problems.

###### Lemma 3.2 (Strong FTL Lemma [16]).

For any sequence of decisions

, , where , where is the decision set.

Namely, if the decision made in round in IL is close to the best decision in round after the new per-round cost is revealed (which depends on ), then the regret will be small.

The two algorithms are summarized in Algorithm 1, which mainly differ in the policy update rule (line 5). Like DAgger, they both learn the policy in an interactive manner. In round , both algorithms execute the current policy in the real environment to collect data to define the per-round cost functions (line 3): is an unbiased estimate of in (6) for policy learning, and is an unbiased estimate of the per-round cost for model learning. Given the current per-round costs, the two algorithms then update the model (line 4) and the policy (line 5) using the respective rules. Here we use the set , abstractly, to denote the family of predictive models to estimate , and is defined as an upper bound of the prediction error. For example, can be a family of dynamics models that are used to simulate the predicted gradients, and is the empirical loss function used to train the dynamics models (e.g. the KL divergence of prediction).

#### 3.2.1 A Conceptual Algorithm: MoBIL-VI

We first present our conceptual algorithm MoBIL-VI, which is simpler to explain. We assume that and are given, as in Theorem 2.1. This assumption will be removed in MoBIL-Prox later. To realize the idea of BTL, in round , MoBIL-VI uses a newly learned predictive model to estimate of in (5) and then updates the policy by solving the VI problem below: finding such that ,

 ⟨Φn(πn+1),π′−πn+1⟩≥0, (8)

where the vector field is defined as

 Φn(π)=∑nm=1wm∇fm(π)+wn+1∇2^Fn+1(π,π)

Suppose is the partial derivative of some bivariate function . If , then the VI problem666 Because is compact, the VI problem in (8) has at least one solution [14]. If is strongly convex, the VI problem in line 6 of Algorithm 1 is strongly monotone for large enough and can be solved e.g. by basic projection method [14]. Therefore, for demonstration purpose, we assume the VI problem of MoBIL-VI can be exactly solved. in (8) finds a fixed point satisfying . That is, if exactly, then plays exactly BTL and by Lemma 3.2 the regret is non-positive. In general, we can show that, even with modeling errors, MoBIL-VI can still reach a faster convergence rate such as , if a non-uniform weighting scheme is used, the model is updated online, and is realizable within . The details will be presented in Section 4.2.

#### 3.2.2 A Practical Algorithm: MoBIL-Prox

While the previous conceptual algorithm achieves a faster convergence, it requires solving a nontrivial VI problem in each iteration. In addition, it assumes is given as a function and requires keeping all the past data to define . Here we relax these unrealistic assumptions and propose MoBIL-Prox. In round of MoBIL-Prox, the policy is updated from to by taking two gradient steps:

 (9)

We define as an -strongly convex function (with ; we recall is the strongly convexity modulus of ) such that is its global minimum and (e.g. a Bregman divergence). And we define and as estimates of and , respectively. Here we only require to be unbiased, whereas could be a biased estimate of .

MoBIL-Prox treats , which plays FTL with from the real environment, as a rough estimate of the next policy and uses it to query an gradient estimate from the model . Therefore, the learner’s decision can approximately play BTL. If we compare the update rule of and the VI problem in (8), we can see that MoBIL-Prox linearizes the problem and attempts to approximate by . While the above approximation is crude, interestingly it is sufficient to speed up the convergence rate to be as fast as MoBIL-VI under mild assumptions, as shown later in Section 4.3.

### 3.3 Predictive Models

MoBIL uses in the update rules (8) and (9) at round to predict the unseen gradient at round for speeding up policy learning. Ideally should approximate the unknown bivariate function so that and are close. This condition can be seen from (8) and (9), in which MoBIL concerns only instead of directly. In other words, is used in MoBIL as a first-order oracle, which leverages all the past information (up to the learner playing in the environment at round ) to predict the future gradient , which depends on the decision the learner is about to make. Hence, we call it a predictive model.

To make the idea concrete, we provide a few examples of these models. By definition of in (5), one way to construct the predictive model is through a simulator with an (online learned) dynamics model, and define as the simulated gradient (computed by querying the expert along the simulated trajectories visited by the learner). If the dynamics model is exact, then . Note that a stochastic/biased estimate of suffices to update the policies in MoBIL-Prox.

Another idea is to construct the predictive model through (the stochastic estimate of ) and indirectly define such that . This choice is possible, because the learner in IL collects samples from the environment, as opposed to, literally, gradients. Specifically, we can define and in (9). The approximation error of setting is determined by the convergence and the stability of the learner’s policy. If visits similar states as , then can approximate well at . Note that this choice is different from using the previous gradient (i.e. ) in optimistic mirror descent/FTL [17], which would have a larger approximation error due to additional linearization.

Finally, we note that while the concept of predictive models originates from estimating the partial derivatives , a predictive model does not necessarily have to be in the same form. A parameterized vector-valued function can also be directly learned to approximate

, e.g., using a neural network and the sampled gradients

in a supervised learning fashion.

## 4 Theoretical Analysis

Now we prove that using predictive models in MoBIL can accelerate convergence, when proper conditions are met. Intuitively, MoBIL converges faster than the usual adversarial approach to IL (like DAgger), when the predictive models have smaller errors than not predicting anything at all (i.e. setting ). In the following analyses, we will focus on bounding the expected weighted average regret, as it directly translates into the average performance bound by Lemma 3.1. We define, for ,

 R(p)\coloneqqE[regretw(Π)/w1:N] (10)

Note that the results below assume that the predictive models are updated using FTL as outlined in Algorithm 1. This assumption applies, e.g., when a dynamics model is learned online in a simulator-oracle as discussed above. We provide full proofs in Appendix C and provide a summary of notation in Appendix A.

### 4.1 Assumptions

We first introduce several assumptions to more precisely characterize the online IL problem.

##### Predictive models

Let be the class of predictive models. We assume these models are Lipschitz continuous in the following sense.

###### Assumption 4.1.

There is such that , and .

##### Per-round costs

The per-round cost for policy learning is given in (6), and we define as an upper bound of (see e.g. Appendix D). We make structural assumptions on and , similar to the ones made by Ross et al. [2] (cf. Theorem 2.1).

###### Assumption 4.2.

Let . With probability , is -strongly convex, and , ; is -strongly convex, and , .

By definition, these properties extend to and . We note they can be relaxed to solely convexity and our algorithms still improve the best known convergence rate (see Table 1 and Appendix E).

##### Expressiveness of hypothesis classes

We introduce two constants, and , to characterize the policy class and model class , which generalize the idea of (4) to stochastic and general weighting settings. When and is constant, Definition 4.1 agrees with (4). Similarly, we see that if and , then and are zero.

###### Definition 4.1.

A policy class is -close to , if for all and weight sequence with , . Similarly, a model class is -close to , if . The expectations above are due to sampling and .

### 4.2 Performance of MoBIL-VI

Here we show the performance for MoBIL-VI when there is prediction error in . The main idea is to treat MoBIL-VI as online learning with prediction [17] and take obtained after solving the VI problem (8) as an estimate of .

###### Proposition 4.1.

For MoBIL-VI with , .

By Lemma 3.1, this means that if the model class is expressive enough (i.e ), then by adapting the model online with FTL, we can improve the original convergence rate in of Ross et al. [2] to . While removing the factor does not seem like much, we will show that running MoBIL-VI can improve the convergence rate to , when a non-uniform weighting is adopted.

###### Theorem 4.1.

For MoBIL-VI with , , where .

The key is that can be upper bounded by the regret of the online learning for models, which has per-round cost . Therefore, if , randomly picking a policy out of proportional to weights has expected convergence in if .999If , it converges in ; if , it converges in . See Appendix C.2.

### 4.3 Performance of MoBIL-Prox

As MoBIL-Prox uses gradient estimates, we additionally define two constants and to characterize the estimation error, where also entails potential bias.

###### Assumption 4.3.

and

We show this simple first-order algorithm achieves similar performance to MoBIL-VI. Toward this end, we introduce a stronger lemma than Lemma 3.2.

###### Lemma 4.1 (Stronger FTL Lemma).

Let . For any sequence of decisions and losses , , where .

The additional term in Lemma 4.1 is pivotal to prove the performance of MoBIL-Prox.

###### Theorem 4.2.

For MoBIL-Prox with and , it satisfies

 R(p)≤(p+1)2epNαμf(G2hμhpp−11N2+2pσ2g+σ2^g+ϵw^FN)+(p+1)νpNp+1,

where and .

###### Proof sketch.

Here we give a proof sketch in big-O notation (see Appendix C.3 for the details). To bound , recall the definition . Now define . Since is -strongly convex, is -strongly convex, and , we know that satisfies that , . This implies , where .

The following lemma upper bounds by using Stronger FTL lemma (Lemma 4.1).

###### Lemma 4.2.

.

Since the second term in Lemma 4.2 is negative, we just need to upper bound the expectation of the first item. Using the triangle inequality, we bound the model’s prediction error of the next per-round cost.

###### Lemma 4.3.

With Lemma 4.3 and Lemma 4.2, it is now clear that , where . When is large enough, , and hence the first term is . For the third term, because the model is learned online using, e.g., FTL with strongly convex cost we can show that . Thus, . Substituting this bound into and using that the fact proves the theorem. ∎

The main assumption in Theorem 4.2 is that is -Lipschitz continuous (Assumption 4.1). It does not depend on the continuity of . Therefore, this condition is practical as we are free to choose . Compared with Theorem 4.1, Theorem 4.2 considers the inexactness of and explicitly; hence the additional term due to and . Under the same assumption of MoBIL-VI that and are directly available, we can actually show that the simple MoBIL-Prox has the same performance as MoBIL-VI, which is a corollary of Theorem 4.2.

###### Corollary 4.1.

If and , for MoBIL-Prox with , .

The proof of Theorem 4.1 and 4.2 are based on assuming the predictive models are updated by FTL (see Appendix D for a specific bound when online learned dynamics models are used as a simulator). However, we note that these results are essentially based on the property that model learning also has no regret; therefore, the FTL update rule (line 4) can be replaced by a no-regret first-order method without changing the result. This would make the algorithm even simpler to implement. The convergence of other types of predictive models (like using the previous cost function discussed in Section 3.3) can also be analyzed following the major steps in the proof of Theorem 4.2, leading to a performance bound in terms of prediction errors. Finally, it is interesting to note that the accelerated convergence is made possible when model learning puts more weight on costs in later rounds (because ).

### 4.4 Comparison

We compare the performance of MoBIL in Theorem 4.2 with that of DAgger in Theorem 2.1 in terms of the constant on the factor. MoBIL has a constant in , whereas DAgger has a constant in , where we recall and are upper bounds of and , respectively.101010Theorem 2.1 was stated by assuming . In the stochastic setup here, DAgger has a similar convergence rate in expectation but with replaced by . Therefore, in general, MoBIL-Prox has a better upper bound than DAgger when the model class is expressive (i.e. ), because

(the variance of the sampled gradients) can be made small as we are free to design the model. Note that, however, the improvement of

MoBIL may be smaller when the problem is noisy, such that the large becomes the dominant term.

An interesting property that arises from Theorems 4.1 and 4.2 is that the convergence of MoBIL is not biased by using an imperfect model (i.e. ). This is shown in the term . In other words, in the worst case of using an extremely wrong predictive model, MoBIL would just converge more slowly but still to the performance of the expert policy.

MoBIL-Prox is closely related to stochastic Mirror-Prox [18, 11]. In particular, when the exact model is known (i.e. ) and MoBIL-Prox is set to convex-mode (i.e. for , and ; see Appendix E), then MoBIL-Prox gives the same update rule as stochastic Mirror-Prox with step size (See Appendix F for a thorough discussion). Therefore, MoBIL-Prox can be viewed as a generalization of Mirror-Prox: 1) it allows non-uniform weights; and 2) it allows the vector field to be estimated online by alternately taking stochastic gradients and predicted gradients. The design of MoBIL-Prox is made possible by our Stronger FTL lemma (Lemma 4.1), which greatly simplifies the original algebraic proof in [18, 11]. Using Lemma 4.1 reveals more closely the interactions between model updates and policy updates. In addition, it more clearly shows the effect of non-uniform weighting, which is essential to achieving convergence. To the best of our knowledge, even the analysis of the original (stochastic) Mirror-Prox from the FTL perspective is new.

## 5 Experiments

We experimented with MoBIL-Prox in simulation to study how weights and the choice of model oracles affect the learning. We used two weight schedules: as baseline, and suggested by Theorem 4.2. And we considered several predictive models: (a) a simulator with the true dynamics (b) a simulator with online-learned dynamics (c) the last cost function (i.e. (d) no model (i.e. ; in this case MoBIL-Prox reduces to the first-order version of DAgger [9], which is considered as a baseline here).

### 5.1 Setup and Results

Two robot control tasks (CartPole and Reacher3D) powered by the DART physics engine [19] were used as the task environments. The learner was either a linear policy or a small neural network. For each IL problem, an expert policy that shares the same architecture as the learner was used, which was trained using policy gradients. While sharing the same architecture is not required in IL, here we adopted this constraint to remove the bias due to the mismatch between policy class and the expert policy to clarify the experimental results. For MoBIL-Prox, we set and set such that , where and was adaptive to the norm of the prediction error. This leads to an effective learning rate which is optimal in the convex setting (cf. Table 1). For the dynamics model, we used a neural network and trained it using FTL. The results reported are averaged over 24 (CartPole) and 12 (Reacher3D) seeds. Figure 1 shows the results of MoBIL-Prox. While the use of neural network policies violates the convexity assumptions in the analysis, it is interesting to see how MoBIL-Prox performs in this more practical setting. We include the experiment details in Appendix G for completeness.

### 5.2 Discussions

We observe that, when , having model information does not improve the performance much over standard online IL (i.e. no model), as suggested in Proposition 4.1. By contrast, when (as suggested by Theorem 4.2), MoBIL-Prox improves the convergence and performs better than not using models.111111We note that the curves between and are not directly comparable; we should only compare methods within the same setting as the optimal step size varies with . The multiplier on the step size was chosen such that MoBIL-Prox performs similarly in both settings. It is interesting to see that this trend also applies to neural network policies.

From Figure 1, we can also study how the choice of predictive models affects the convergence. As suggested in Theorem 4.2, MoBIL-Prox improves the convergence only when the model makes non-trivial predictions. If the model is very incorrect, then MoBIL-Prox can be slower. This can be seen from the performance of MoBIL-Prox with online learned dynamics models. In the low-dimensional case of CartPole, the simple neural network predicts the dynamics well, and MoBIL-Prox with the learned dynamics performs similarly as MoBIL-Prox with the true dynamics. However, in the high-dimensional Reacher3D problem, the learned dynamics model generalizes less well, creating a performance gap between MoBIL-Prox using the true dynamics and that using the learned dynamics. We note that MoBIL-Prox would still converge at the end despite the model error. Finally, we find that the performance of MoBIL with the last-cost predictive model is often similar to MoBIL-Prox with the simulated gradients computed through the true dynamics.

## 6 Conclusion

We propose two novel model-based IL algorithms MoBIL-Prox and MoBIL-VI with strong theoretical properties: they are provably up-to-and-order faster than the state-of-the-art IL algorithms and have unbiased performance even when using imperfect predictive models. Although we prove the performance under convexity assumptions, we empirically find that MoBIL-Prox improves the performance even when using neural networks. In general, MoBIL accelerates policy learning when having access to an predictive model that can predict future gradients non-trivially. While the focus of the current paper is theoretical in nature, the design of MoBIL leads to several interesting questions that are important to reliable application of MoBIL-Prox in practice, such as end-to-end learning of predictive models and designing adaptive regularizations for MoBIL-Prox.

#### Acknowledgements

This work was supported in part by NSF NRI Award 1637758 and NSF CAREER Award 1750483.