1 Introduction
Imitation learning (IL) has recently received attention for its ability to speed up policy learning when solving reinforcement learning problems (RL) [1, 2, 3, 4, 5, 6]. Unlike pure RL techniques, which rely on uniformed random exploration to locally improve a policy, IL leverages prior knowledge about a problem in terms of expert demonstrations. At a high level, this additional information provides policy learning with an informed search direction toward the expert policy.
The goal of IL is to quickly learn a policy that can perform at least as well as the expert policy. Because the expert policy may be suboptimal with respect to the RL problem of interest, performing IL is often used to provide a good warm start to the RL problem, so that the number of interactions with the environment can be minimized. Sample efficiency is especially critical when learning is deployed in applications like robotics, where every interaction incurs realworld costs.
By reducing IL to an online learning problem, online IL [2] provides a framework for convergence analysis and mitigates the covariate shift problem encountered in batch IL [7, 8]. In particular, under proper assumptions, the performance of a policy sequence updated by FollowtheLeader (FTL) can converge on average to the performance of the expert policy [2]. Recently, it was shown that this rate is sufficient to make IL more efficient than solving an RL problem from scratch [9].
In this work, we further accelerate the convergence rate of online IL. Inspired by the observation of Cheng and Boots [10] that the online learning problem of IL is not truly adversarial, we propose two MOdelBased IL (MoBIL) algorithms, MoBILVI and MoBILProx, that can achieve a fast rate of convergence. Under the same assumptions of Ross et al. [2], these algorithms improve onaverage convergence to , e.g., when a dynamics model is learned online, where is the number of iterations of policy update.
The improved speed of our algorithms is attributed to using a model oracle to predict the gradient of the next perround cost in online learning. This model can be realized, e.g., using a simulator based on a (learned) dynamics model, or using past demonstrations. We first conceptually show that this idea can be realized as a variational inequality problem in MoBILVI. Next, we propose a practical firstorder stochastic algorithm MoBILProx, which alternates between the steps of taking the true gradient and of taking the model gradient. MoBILProx is a generalization of stochastic MirrorProx proposed by Juditsky et al. [11]
to the case where the problem is weighted and the vector field is unknown but learned online. In theory, we show that having a
weighting scheme is pivotal to speeding up convergence, and this generalization is made possible by a new constructive FTLstyle regret analysis, which greatly simplifies the original algebraic proof [11]. The performance of MoBILProx is also empirically validated in simulation.2 Preliminaries
2.1 Problem Setup: RL and IL
Let and be the state and the action spaces, respectively. The objective of RL is to search for a stationary policy inside a policy class with good performance. This can be characterized by the stochastic optimization problem with expected cost^{1}^{1}1Our definition of corresponds to the average accumulated cost in the RL literature. defined below:
(1) 
in which , , is the instantaneous cost at time , is a generalized stationary distribution induced by executing policy , and is the distribution of action given state of . The policies here are assumed to be parametric. To make the writing compact, we will abuse the notation to also denote its parameter, and assume is a compact convex subset of parameters in some normed space with norm .
Based on the abstracted distribution , the formulation in (1) subsumes multiple discretetime RL problems. For example, a discounted infinitehorizon problem can be considered by setting
as a timeinvariant cost and defining the joint distribution
, in whichdenotes the probability (density) of state
at time under policy . Similarly, a horizon RL problem can be considered by setting . Note that while we use the notation, the policy is allowed to be deterministic; in this case, the notation means evaluation. For notational compactness, we will often omit the random variable inside the expectation (e.g. we shorten (
1) to ). In addition, we denote as the Qfunction^{2}^{2}2For example, in a horizon problem, , where denotes the distribution of future trajectory conditioned on . at time with respect to .In this paper, we consider IL, which is an indirect approach to solving the RL problem. We assume there is a blackbox oracle , called the expert policy, from which demonstration can be queried for any state . To satisfy the querying requirement, usually the expert policy is an algorithm; for example, it can represent a planning algorithm which solves a simplified version of (1), or some engineered, hardcoded policy (see e.g. [12]).
The purpose of incorporating the expert policy into solving (1) is to quickly obtain a policy that has reasonable performance. Toward this end, we consider solving a surrogate problem of (1),
(2) 
where is a function that measures the difference between two distributions over actions (e.g. KL divergence; see Appendix B). Importantly, the objective in (2) has the property that and there is constant such that , it satisfies , in which denotes the set of natural numbers. By the Performance Difference Lemma [13], it can be shown that the inequality above implies [10],
(3) 
Therefore, solving (2) can lead to a policy that performs similarly to the expert policy .
2.2 Imitation Learning as Online Learning
The surrogate problem in (2) is more structured than the original RL problem in (1). In particular, when the distancelike function is given, and we know that is close to zero when is close to . On the contrary, in (1) generally can still be large, even if is a good policy (since it also depends on the state). This normalization property is crucial for the reduction from IL to online learning [10].
The reduction is based on observing that, with the normalization property, the expressiveness of the policy class can be described with a constant defined as,
(4) 
for all , which measures the average difference between and with respect to and the state distributions visited by a worst possible policy sequence. Ross et al. [2] make use of this property and reduce (2) into an online learning problem by distinguishing the influence of on and on in (2). To make this transparent, we define a bivariate function
(5) 
Using this bivariate function , the online learning setup can be described as follows: in round , the learner applies a policy and then the environment reveals a perround cost
(6) 
Ross et al. [2] show that if the sequence is selected by a noregret algorithm, then it will have good performance in terms of (2). For example, DAgger updates the policy by FTL, and has the following guarantee (cf. [10]), where we define the shorthand .
Theorem 2.1.
Let . If each is strongly convex and , then DAgger has performance on average satisfying
(7) 
Firstorder variants of DAgger based on FollowtheRegularizedLeader (FTRL) have also been proposed by Sun et al. [5] and Cheng et al. [9], which have the same performance but only require taking a stochastic gradient step in each iteration without keeping all the previous cost functions (i.e. data) as in the original FTL formulation. The bound in Theorem 2.1 also applies to the expected performance of a policy randomly picked out of the sequence , although it does not necessarily translate into the performance of the last policy [10].
3 Accelerating Il With Predictive Models
The reductionbased approach to solving IL has demonstrated sucess in speeding up policy learning. However, because interactions with the environment are necessary to approximately evaluate the perround cost, it is interesting to determine if the convergence rate of IL can be further improved. A faster convergence rate will be valuable in realworld applications where data collection is expensive.
We answer this question affirmatively. We show that, by modeling^{3}^{3}3We define as a vector field the convergence rate of IL can potentially be improved by up to an order, where denotes the derivative to the second argument. The improvement comes through leveraging the fact that the perround cost defined in (6) is not completely unknown or adversarial as it is assumed in the most general online learning setting. Because the same function is used in (6) over different rounds, the online component actually comes from the reduction made by Ross et al. [2], which ignores information about how changes with the left argument; in other words, it omits the variations of when changes [10]. Therefore, we argue that the original reduction proposed by Ross et al. [2], while allowing the use of (4) to characterize the performance, loses one critical piece of information present in the original RL problem: both the system dynamics and the expert are the same across different rounds of online learning.
We propose two modelbased algorithms (MoBILVI and MoBILProx) to accelerate IL. The first algorithm, MoBILVI, is conceptual in nature and updates policies by solving variational inequality (VI) problems [14]. This algorithm is used to illustrate how modeling through a predictive model can help to speed up IL, where is a model bivariate function.^{4}^{4}4While we only concern predicting the vector field , we adopt the notation to better build up the intuition, especially of MoBILVI; we will discuss other approximations that are not based on bivariate functions in Section 3.3. The second algorithm, MoBILProx is a firstorder method. It alternates between taking stochastic gradients by interacting with the environment and querying the model . We will prove that this simple yet practical approach has the same performance as the conceptual one: when is learned online and is realizable, e.g. both algorithms can converge in , in contrast to DAgger’s convergence. In addition, we show the convergence results of MoBIL under relaxed assumptions, e.g. allowing stochasticity, and provide several examples of constructing predictive models. (See Appendix A for a summary of notation.)
3.1 Performance and Average Regret
Before presenting the two algorithms, we first summarize the core idea of the reduction from IL to online learning in a simple lemma, which builds the foundation of our algorithms (proved in Appendix C.1).
Lemma 3.1.
For arbitrary sequences and , it holds that
where
is an unbiased estimate of
, , is given in Definition 4.1, and the expectation is due to sampling .3.2 Algorithms
From Lemma 3.1, we know that improving the regret bound implies a faster convergence of IL. This leads to the main idea of MoBILVI and MoBILProx: to use model information to approximately play BetheLeader (BTL) [15], i.e. . To understand why playing BTL can minimize the regret, we recall a classical regret bound of online learning.^{5}^{5}5We use notation and to distinguish general online learning problems from online IL problems.
Lemma 3.2 (Strong FTL Lemma [16]).
Namely, if the decision made in round in IL is close to the best decision in round after the new perround cost is revealed (which depends on ), then the regret will be small.
The two algorithms are summarized in Algorithm 1, which mainly differ in the policy update rule (line 5). Like DAgger, they both learn the policy in an interactive manner. In round , both algorithms execute the current policy in the real environment to collect data to define the perround cost functions (line 3): is an unbiased estimate of in (6) for policy learning, and is an unbiased estimate of the perround cost for model learning. Given the current perround costs, the two algorithms then update the model (line 4) and the policy (line 5) using the respective rules. Here we use the set , abstractly, to denote the family of predictive models to estimate , and is defined as an upper bound of the prediction error. For example, can be a family of dynamics models that are used to simulate the predicted gradients, and is the empirical loss function used to train the dynamics models (e.g. the KL divergence of prediction).
3.2.1 A Conceptual Algorithm: MoBILVI
We first present our conceptual algorithm MoBILVI, which is simpler to explain. We assume that and are given, as in Theorem 2.1. This assumption will be removed in MoBILProx later. To realize the idea of BTL, in round , MoBILVI uses a newly learned predictive model to estimate of in (5) and then updates the policy by solving the VI problem below: finding such that ,
(8) 
where the vector field is defined as
Suppose is the partial derivative of some bivariate function . If , then the VI problem^{6}^{6}6 Because is compact, the VI problem in (8) has at least one solution [14]. If is strongly convex, the VI problem in line 6 of Algorithm 1 is strongly monotone for large enough and can be solved e.g. by basic projection method [14]. Therefore, for demonstration purpose, we assume the VI problem of MoBILVI can be exactly solved. in (8) finds a fixed point satisfying . That is, if exactly, then plays exactly BTL and by Lemma 3.2 the regret is nonpositive. In general, we can show that, even with modeling errors, MoBILVI can still reach a faster convergence rate such as , if a nonuniform weighting scheme is used, the model is updated online, and is realizable within . The details will be presented in Section 4.2.
3.2.2 A Practical Algorithm: MoBILProx
While the previous conceptual algorithm achieves a faster convergence, it requires solving a nontrivial VI problem in each iteration. In addition, it assumes is given as a function and requires keeping all the past data to define . Here we relax these unrealistic assumptions and propose MoBILProx. In round of MoBILProx, the policy is updated from to by taking two gradient steps:
(9) 
We define as an strongly convex function (with ; we recall is the strongly convexity modulus of ) such that is its global minimum and (e.g. a Bregman divergence). And we define and as estimates of and , respectively. Here we only require to be unbiased, whereas could be a biased estimate of .
MoBILProx treats , which plays FTL with from the real environment, as a rough estimate of the next policy and uses it to query an gradient estimate from the model . Therefore, the learner’s decision can approximately play BTL. If we compare the update rule of and the VI problem in (8), we can see that MoBILProx linearizes the problem and attempts to approximate by . While the above approximation is crude, interestingly it is sufficient to speed up the convergence rate to be as fast as MoBILVI under mild assumptions, as shown later in Section 4.3.
3.3 Predictive Models
MoBIL uses in the update rules (8) and (9) at round to predict the unseen gradient at round for speeding up policy learning. Ideally should approximate the unknown bivariate function so that and are close. This condition can be seen from (8) and (9), in which MoBIL concerns only instead of directly. In other words, is used in MoBIL as a firstorder oracle, which leverages all the past information (up to the learner playing in the environment at round ) to predict the future gradient , which depends on the decision the learner is about to make. Hence, we call it a predictive model.
To make the idea concrete, we provide a few examples of these models. By definition of in (5), one way to construct the predictive model is through a simulator with an (online learned) dynamics model, and define as the simulated gradient (computed by querying the expert along the simulated trajectories visited by the learner). If the dynamics model is exact, then . Note that a stochastic/biased estimate of suffices to update the policies in MoBILProx.
Another idea is to construct the predictive model through (the stochastic estimate of ) and indirectly define such that . This choice is possible, because the learner in IL collects samples from the environment, as opposed to, literally, gradients. Specifically, we can define and in (9). The approximation error of setting is determined by the convergence and the stability of the learner’s policy. If visits similar states as , then can approximate well at . Note that this choice is different from using the previous gradient (i.e. ) in optimistic mirror descent/FTL [17], which would have a larger approximation error due to additional linearization.
Finally, we note that while the concept of predictive models originates from estimating the partial derivatives , a predictive model does not necessarily have to be in the same form. A parameterized vectorvalued function can also be directly learned to approximate
, e.g., using a neural network and the sampled gradients
in a supervised learning fashion.
4 Theoretical Analysis
Now we prove that using predictive models in MoBIL can accelerate convergence, when proper conditions are met. Intuitively, MoBIL converges faster than the usual adversarial approach to IL (like DAgger), when the predictive models have smaller errors than not predicting anything at all (i.e. setting ). In the following analyses, we will focus on bounding the expected weighted average regret, as it directly translates into the average performance bound by Lemma 3.1. We define, for ,
(10) 
Note that the results below assume that the predictive models are updated using FTL as outlined in Algorithm 1. This assumption applies, e.g., when a dynamics model is learned online in a simulatororacle as discussed above. We provide full proofs in Appendix C and provide a summary of notation in Appendix A.
4.1 Assumptions
We first introduce several assumptions to more precisely characterize the online IL problem.
Predictive models
Let be the class of predictive models. We assume these models are Lipschitz continuous in the following sense.
Assumption 4.1.
There is such that , and .
Perround costs
The perround cost for policy learning is given in (6), and we define as an upper bound of (see e.g. Appendix D). We make structural assumptions on and , similar to the ones made by Ross et al. [2] (cf. Theorem 2.1).
Assumption 4.2.
Let . With probability , is strongly convex, and , ; is strongly convex, and , .
By definition, these properties extend to and . We note they can be relaxed to solely convexity and our algorithms still improve the best known convergence rate (see Table 1 and Appendix E).
convex  strongly convex  Without model  

convex  
strongly convex 
Expressiveness of hypothesis classes
We introduce two constants, and , to characterize the policy class and model class , which generalize the idea of (4) to stochastic and general weighting settings. When and is constant, Definition 4.1 agrees with (4). Similarly, we see that if and , then and are zero.
Definition 4.1.
A policy class is close to , if for all and weight sequence with , . Similarly, a model class is close to , if . The expectations above are due to sampling and .
4.2 Performance of MoBILVI
Here we show the performance for MoBILVI when there is prediction error in . The main idea is to treat MoBILVI as online learning with prediction [17] and take obtained after solving the VI problem (8) as an estimate of .
Proposition 4.1.
For MoBILVI with , .
By Lemma 3.1, this means that if the model class is expressive enough (i.e ), then by adapting the model online with FTL, we can improve the original convergence rate in of Ross et al. [2] to . While removing the factor does not seem like much, we will show that running MoBILVI can improve the convergence rate to , when a nonuniform weighting is adopted.
Theorem 4.1.
For MoBILVI with , , where .
The key is that can be upper bounded by the regret of the online learning for models, which has perround cost . Therefore, if , randomly picking a policy out of proportional to weights has expected convergence in if .^{9}^{9}9If , it converges in ; if , it converges in . See Appendix C.2.
4.3 Performance of MoBILProx
As MoBILProx uses gradient estimates, we additionally define two constants and to characterize the estimation error, where also entails potential bias.
Assumption 4.3.
and
We show this simple firstorder algorithm achieves similar performance to MoBILVI. Toward this end, we introduce a stronger lemma than Lemma 3.2.
Lemma 4.1 (Stronger FTL Lemma).
Let . For any sequence of decisions and losses , , where .
The additional term in Lemma 4.1 is pivotal to prove the performance of MoBILProx.
Theorem 4.2.
For MoBILProx with and , it satisfies
where and .
Proof sketch.
Here we give a proof sketch in bigO notation (see Appendix C.3 for the details). To bound , recall the definition . Now define . Since is strongly convex, is strongly convex, and , we know that satisfies that , . This implies , where .
The following lemma upper bounds by using Stronger FTL lemma (Lemma 4.1).
Lemma 4.2.
.
Since the second term in Lemma 4.2 is negative, we just need to upper bound the expectation of the first item. Using the triangle inequality, we bound the model’s prediction error of the next perround cost.
Lemma 4.3.
With Lemma 4.3 and Lemma 4.2, it is now clear that , where . When is large enough, , and hence the first term is . For the third term, because the model is learned online using, e.g., FTL with strongly convex cost we can show that . Thus, . Substituting this bound into and using that the fact proves the theorem. ∎
The main assumption in Theorem 4.2 is that is Lipschitz continuous (Assumption 4.1). It does not depend on the continuity of . Therefore, this condition is practical as we are free to choose . Compared with Theorem 4.1, Theorem 4.2 considers the inexactness of and explicitly; hence the additional term due to and . Under the same assumption of MoBILVI that and are directly available, we can actually show that the simple MoBILProx has the same performance as MoBILVI, which is a corollary of Theorem 4.2.
Corollary 4.1.
If and , for MoBILProx with , .
The proof of Theorem 4.1 and 4.2 are based on assuming the predictive models are updated by FTL (see Appendix D for a specific bound when online learned dynamics models are used as a simulator). However, we note that these results are essentially based on the property that model learning also has no regret; therefore, the FTL update rule (line 4) can be replaced by a noregret firstorder method without changing the result. This would make the algorithm even simpler to implement. The convergence of other types of predictive models (like using the previous cost function discussed in Section 3.3) can also be analyzed following the major steps in the proof of Theorem 4.2, leading to a performance bound in terms of prediction errors. Finally, it is interesting to note that the accelerated convergence is made possible when model learning puts more weight on costs in later rounds (because ).
4.4 Comparison
We compare the performance of MoBIL in Theorem 4.2 with that of DAgger in Theorem 2.1 in terms of the constant on the factor. MoBIL has a constant in , whereas DAgger has a constant in , where we recall and are upper bounds of and , respectively.^{10}^{10}10Theorem 2.1 was stated by assuming . In the stochastic setup here, DAgger has a similar convergence rate in expectation but with replaced by . Therefore, in general, MoBILProx has a better upper bound than DAgger when the model class is expressive (i.e. ), because
(the variance of the sampled gradients) can be made small as we are free to design the model. Note that, however, the improvement of
MoBIL may be smaller when the problem is noisy, such that the large becomes the dominant term.An interesting property that arises from Theorems 4.1 and 4.2 is that the convergence of MoBIL is not biased by using an imperfect model (i.e. ). This is shown in the term . In other words, in the worst case of using an extremely wrong predictive model, MoBIL would just converge more slowly but still to the performance of the expert policy.
MoBILProx is closely related to stochastic MirrorProx [18, 11]. In particular, when the exact model is known (i.e. ) and MoBILProx is set to convexmode (i.e. for , and ; see Appendix E), then MoBILProx gives the same update rule as stochastic MirrorProx with step size (See Appendix F for a thorough discussion). Therefore, MoBILProx can be viewed as a generalization of MirrorProx: 1) it allows nonuniform weights; and 2) it allows the vector field to be estimated online by alternately taking stochastic gradients and predicted gradients. The design of MoBILProx is made possible by our Stronger FTL lemma (Lemma 4.1), which greatly simplifies the original algebraic proof in [18, 11]. Using Lemma 4.1 reveals more closely the interactions between model updates and policy updates. In addition, it more clearly shows the effect of nonuniform weighting, which is essential to achieving convergence. To the best of our knowledge, even the analysis of the original (stochastic) MirrorProx from the FTL perspective is new.
5 Experiments
We experimented with MoBILProx in simulation to study how weights and the choice of model oracles affect the learning. We used two weight schedules: as baseline, and suggested by Theorem 4.2. And we considered several predictive models: (a) a simulator with the true dynamics (b) a simulator with onlinelearned dynamics (c) the last cost function (i.e. (d) no model (i.e. ; in this case MoBILProx reduces to the firstorder version of DAgger [9], which is considered as a baseline here).
5.1 Setup and Results
Two robot control tasks (CartPole and Reacher3D) powered by the DART physics engine [19] were used as the task environments. The learner was either a linear policy or a small neural network. For each IL problem, an expert policy that shares the same architecture as the learner was used, which was trained using policy gradients. While sharing the same architecture is not required in IL, here we adopted this constraint to remove the bias due to the mismatch between policy class and the expert policy to clarify the experimental results. For MoBILProx, we set and set such that , where and was adaptive to the norm of the prediction error. This leads to an effective learning rate which is optimal in the convex setting (cf. Table 1). For the dynamics model, we used a neural network and trained it using FTL. The results reported are averaged over 24 (CartPole) and 12 (Reacher3D) seeds. Figure 1 shows the results of MoBILProx. While the use of neural network policies violates the convexity assumptions in the analysis, it is interesting to see how MoBILProx performs in this more practical setting. We include the experiment details in Appendix G for completeness.
5.2 Discussions
We observe that, when , having model information does not improve the performance much over standard online IL (i.e. no model), as suggested in Proposition 4.1. By contrast, when (as suggested by Theorem 4.2), MoBILProx improves the convergence and performs better than not using models.^{11}^{11}11We note that the curves between and are not directly comparable; we should only compare methods within the same setting as the optimal step size varies with . The multiplier on the step size was chosen such that MoBILProx performs similarly in both settings. It is interesting to see that this trend also applies to neural network policies.
From Figure 1, we can also study how the choice of predictive models affects the convergence. As suggested in Theorem 4.2, MoBILProx improves the convergence only when the model makes nontrivial predictions. If the model is very incorrect, then MoBILProx can be slower. This can be seen from the performance of MoBILProx with online learned dynamics models. In the lowdimensional case of CartPole, the simple neural network predicts the dynamics well, and MoBILProx with the learned dynamics performs similarly as MoBILProx with the true dynamics. However, in the highdimensional Reacher3D problem, the learned dynamics model generalizes less well, creating a performance gap between MoBILProx using the true dynamics and that using the learned dynamics. We note that MoBILProx would still converge at the end despite the model error. Finally, we find that the performance of MoBIL with the lastcost predictive model is often similar to MoBILProx with the simulated gradients computed through the true dynamics.
6 Conclusion
We propose two novel modelbased IL algorithms MoBILProx and MoBILVI with strong theoretical properties: they are provably uptoandorder faster than the stateoftheart IL algorithms and have unbiased performance even when using imperfect predictive models. Although we prove the performance under convexity assumptions, we empirically find that MoBILProx improves the performance even when using neural networks. In general, MoBIL accelerates policy learning when having access to an predictive model that can predict future gradients nontrivially. While the focus of the current paper is theoretical in nature, the design of MoBIL leads to several interesting questions that are important to reliable application of MoBILProx in practice, such as endtoend learning of predictive models and designing adaptive regularizations for MoBILProx.
Acknowledgements
This work was supported in part by NSF NRI Award 1637758 and NSF CAREER Award 1750483.
References

Abbeel and Ng [2005]
Pieter Abbeel and Andrew Y Ng.
Exploration and apprenticeship learning in reinforcement learning.
In
International conference on Machine learning
, pages 1–8. ACM, 2005. 
Ross et al. [2011]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
noregret online learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 627–635, 2011.  Ross and Bagnell [2014] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.
 Chang et al. [2015] KaiWei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume III, and John Langford. Learning to search better than your teacher. 2015.
 Sun et al. [2017] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. arXiv preprint arXiv:1703.01030, 2017.
 Le et al. [2018] Hoang M Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue, and Hal Daumé III. Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
 Argall et al. [2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
 Bojarski et al. [2017] Mariusz Bojarski, Philip Yeres, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Lawrence Jackel, and Urs Muller. Explaining how a deep neural network trained with endtoend learning steers a car. arXiv preprint arXiv:1704.07911, 2017.
 Cheng et al. [2018] ChingAn Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning using imitation and reinforcement. Conference on Uncertainty in Artificial Intelligence, 2018.
 Cheng and Boots [2018] ChingAn Cheng and Byron Boots. Convergence of value aggregation for imitation learning. In International Conference on Artificial Intelligence and Statistics, volume 84, pages 1801–1809, 2018.
 Juditsky et al. [2011] Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm. Stochastic Systems, 1(1):17–58, 2011.
 Pan et al. [2017] Yunpeng Pan, ChingAn Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Agile offroad autonomous driving using endtoend deep imitation learning. arXiv preprint arXiv:1709.07174, 2017.
 Kakade and Langford [2002] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274, 2002.
 Facchinei and Pang [2007] Francisco Facchinei and JongShi Pang. Finitedimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007.
 Kalai and Vempala [2005] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
 McMahan [2017] H Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017.
 Rakhlin and Sridharan [2013] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. 2013.
 Nemirovski [2004] Arkadi Nemirovski. Proxmethod with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
 Lee et al. [2018] Jeongseok Lee, Michael X Grey, Sehoon Ha, Tobias Kunz, Sumit Jain, Yuting Ye, Siddhartha S Srinivasa, Mike Stilman, and C Karen Liu. Dart: Dynamic animation and robotics toolkit. The Journal of Open Source Software, 3(22):500, 2018.
 Gibbs and Su [2002] Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419–435, 2002.
 Bianchi and Schaible [1996] M Bianchi and S Schaible. Generalized monotone bifunctions and equilibrium problems. Journal of Optimization Theory and Applications, 90(1):31–43, 1996.
 Konnov and Schaible [2000] IV Konnov and S Schaible. Duality for equilibrium problems under generalized monotonicity. Journal of Optimization Theory and Applications, 104(2):395–408, 2000.
 Komlósi [1999] SÁNDOR Komlósi. On the Stampacchia and Minty variational inequalities. Generalized Convexity and Optimization for Economic and Financial Decisions, pages 231–260, 1999.
 HoNguyen and KilincKarzan [2017] Nam HoNguyen and Fatma KilincKarzan. Exploiting problem structure in optimization under uncertainty via online convex optimization. arXiv preprint arXiv:1709.02490, 2017.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
References

Abbeel and Ng [2005]
Pieter Abbeel and Andrew Y Ng.
Exploration and apprenticeship learning in reinforcement learning.
In
International conference on Machine learning
, pages 1–8. ACM, 2005. 
Ross et al. [2011]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
noregret online learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 627–635, 2011.  Ross and Bagnell [2014] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.
 Chang et al. [2015] KaiWei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume III, and John Langford. Learning to search better than your teacher. 2015.
 Sun et al. [2017] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. arXiv preprint arXiv:1703.01030, 2017.
 Le et al. [2018] Hoang M Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue, and Hal Daumé III. Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
 Argall et al. [2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
 Bojarski et al. [2017] Mariusz Bojarski, Philip Yeres, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Lawrence Jackel, and Urs Muller. Explaining how a deep neural network trained with endtoend learning steers a car. arXiv preprint arXiv:1704.07911, 2017.
 Cheng et al. [2018] ChingAn Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning using imitation and reinforcement. Conference on Uncertainty in Artificial Intelligence, 2018.
 Cheng and Boots [2018] ChingAn Cheng and Byron Boots. Convergence of value aggregation for imitation learning. In International Conference on Artificial Intelligence and Statistics, volume 84, pages 1801–1809, 2018.
 Juditsky et al. [2011] Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm. Stochastic Systems, 1(1):17–58, 2011.
 Pan et al. [2017] Yunpeng Pan, ChingAn Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Agile offroad autonomous driving using endtoend deep imitation learning. arXiv preprint arXiv:1709.07174, 2017.
 Kakade and Langford [2002] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274, 2002.
 Facchinei and Pang [2007] Francisco Facchinei and JongShi Pang. Finitedimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007.
 Kalai and Vempala [2005] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
 McMahan [2017] H Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017.
 Rakhlin and Sridharan [2013] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. 2013.
 Nemirovski [2004] Arkadi Nemirovski. Proxmethod with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
 Lee et al. [2018] Jeongseok Lee, Michael X Grey, Sehoon Ha, Tobias Kunz, Sumit Jain, Yuting Ye, Siddhartha S Srinivasa, Mike Stilman, and C Karen Liu. Dart: Dynamic animation and robotics toolkit. The Journal of Open Source Software, 3(22):500, 2018.
 Gibbs and Su [2002] Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419–435, 2002.
 Bianchi and Schaible [1996] M Bianchi and S Schaible. Generalized monotone bifunctions and equilibrium problems. Journal of Optimization Theory and Applications, 90(1):31–43, 1996.
 Konnov and Schaible [2000] IV Konnov and S Schaible. Duality for equilibrium problems under generalized monotonicity. Journal of Optimization Theory and Applications, 104(2):395–408, 2000.
 Komlósi [1999] SÁNDOR Komlósi. On the Stampacchia and Minty variational inequalities. Generalized Convexity and Optimization for Economic and Financial Decisions, pages 231–260, 1999.
 HoNguyen and KilincKarzan [2017] Nam HoNguyen and Fatma KilincKarzan. Exploiting problem structure in optimization under uncertainty via online convex optimization. arXiv preprint arXiv:1709.02490, 2017.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
Appendix A Notation
Symbol  Definition 

the total number of rounds in online learning  
the average accumulated cost, of RL in (1)  
Comments
There are no comments yet.