1 Introduction
In robotics, there is significant interest in using human or algorithmic supervisors to train policies via imitation learning [1, 2, 3, 4]. For example, a trained surgeon with experience teleoperating a surgical robot can provide successful demonstrations of surgical maneuvers [5]. Similarly, known dynamics models can be used by standard control techniques, such as model predictive control (MPC), to generate controls to optimize for task reward [6, 7]. However, there are many cases in which the supervisor is not fixed, but is converging
to improved behavior over time, such as when a human is initially unfamiliar with a teleoperation interface or task or when the dynamics of the system are initially unknown and estimated with experience from the environment when training an algorithmic controller. Furthermore, these supervisors are often
slow, as humans can struggle to execute stable, highfrequency actions on a robot [7] and modelbased control techniques, such as MPC, typically require computationally expensive stochastic optimization techniques to plan over complex dynamics models [8, 9, 10]. This motivates algorithms that can distill supervisors which are both converging and slow into policies that can be efficiently executed in practice. The idea of distilling improving algorithmic controllers into reactive policies has been explored in a class of reinforcement learning (RL) algorithms known as dual policy iteration (DPI) [11, 12, 13], which alternate between optimizing a reactive learner with imitation learning and a modelbased supervisor with data from the learner. However, past methods have mostly been applied in discrete settings [11, 12] or make specific structural assumptions on the supervisor [13].This paper analyzes learning from a converging supervisor in the context of onpolicy imitation learning and shows that this is deeply connected to DPI. Prior analysis of onpolicy imitation learning algorithms provide regret guarantees given a fixed supervisor [14, 15, 16, 17]. We consider a converging sequence of supervisors and show that similar guarantees hold for the regret against the best policy in hindsight with labels from the converged supervisor, even when only intermediate supervisors provide labels during learning. Since the analysis makes no structural assumptions on the supervisor, this flexibility makes it possible to use any offpolicy method as the supervisor in the presented framework, such as an RL algorithm or a human, provided that it converges to a good policy on the learner’s distribution. We implement an instantiation of this framework with the deep MPC algorithm PETS [8] as an improving supervisor and maintain the data efficiency of PETS while significantly reducing online computation time, accelerating both policy learning and evaluation.
The key contribution of this work is a new framework for onpolicy imitation learning from a converging supervisor. We present a new notion of static and dynamic regret for this framework and show sublinear regret guarantees by showing a reduction from this new notion of regret to the standard notion for the fixed supervisor setting. The dynamic regret result is particularly unintuitive, as it indicates that it is possible to do well on each round of learning compared to a learner with labels from the converged supervisor, even though labels are only provided by intermediate supervisors during learning. We then show that the presented framework relaxes assumptions on the supervisor in DPI and perform simulated continuous control experiments suggesting that when a PETS supervisor [8] is used, we can outperform other deep RL baselines while achieving up to an 80fold speedup in policy evaluation. Experiments on a physical surgical robot yield up to an 20fold reduction in query time and 53% reduction in policy evaluation time after accounting for hardware constraints.
2 Related Work
Onpolicy imitation learning algorithms that directly learn reactive policies from a supervisor were popularized with DAgger [18], which iteratively improves the learner by soliciting supervisor feedback on the learner’s trajectory distribution. This yields significant performance gains over analogous offpolicy methods [19, 20]. Onpolicy methods have been applied with both human [21] and algorithmic supervisors [7], but with a fixed supervisor as the guiding policy. We propose a setting where the supervisor improves over time, which is common when learning from a human or when distilling a computationally expensive, iteratively improving controller into a policy that can be efficiently executed in practice. Recently, convergence results and guarantees on regret metrics such as dynamic regret have been shown for the fixed supervisor setting [16, 17, 22]. We extend these results and present a static and dynamic analysis of onpolicy imitation learning from a convergent sequence of supervisors. Recent work proposes using inverse RL to outperform an improving supervisor [23]. We instead study imitation learning in this context to use an evolving supervisor for policy learning.
Modelbased planning has seen significant interest in RL due to the increased efficiency from leveraging structure in settings such as games and robotic control [11, 12, 13]. Furthermore, deep modelbased reinforcement learning (MBRL) has demonstrated superior data efficiency compared to modelfree methods and stateoftheart performance on a variety of continuous control tasks [8, 9, 10]. However, these techniques are often too computationally expensive for highfrequency execution, significantly slowing down policy evaluation. To address the online burden of modelbased algorithms, Sun et al. [13] define a novel class of algorithms, dual policy iteration (DPI), which alternate between optimizing a fast learner for policy evaluation using labels from a modelbased supervisor and optimizing a slower modelbased supervisor using trajectories from the learner. However, past work in DPI either involves planning in discrete state spaces [11, 12], or making specific assumptions on the structure of the modelbased controller [13]. We discuss how the converging supervisor framework is connected to DPI but enables a more flexible supervisor specification. We then provide a practical algorithm by using the deep MBRL algorithm PETS [8] as an improving supervisor to achieve fast policy evaluation while maintaining the sample efficiency of PETS.
3 Converging Supervisor Framework and Preliminaries
3.1 OnPolicy Imitation Learning
We consider continuous control problems in a finitehorizon Markov decision process (MDP), which is defined by a tuple
where is the state space and is the action space. The stochastic dynamics model maps a state and actionto a probability distribution over states,
is the task horizon, and is the reward function. A deterministic control policy maps an input state in to an action in . The goal in RL is to learn a policyover the MDP which induces a trajectory distribution that maximizes the sum of rewards along the trajectory. In imitation learning, this objective is simplified by instead optimizing a surrogate loss function which measures the discrepancy between the actions chosen by learned parameterized policy
and supervisor .Rather than directly optimizing from experience, onpolicy imitation learning involves executing a policy in the environment and then soliciting feedback from a supervisor on the visited states. This is in contrast to offpolicy methods, such as behavior cloning, in which policy learning is performed entirely on states from the supervisor’s trajectory distribution. The surrogate loss of a policy
along a trajectory is a supervised learning cost defined by the supervisor relabeling the trajectory’s states with actions. The goal of onpolicy imitation is to find the policy minimizing the corresponding surrogate risk on its own trajectory distribution. Onpolicy algorithms typically adhere to the following iterative procedure: (1) at iteration
, execute the current policy by deploying the learner in the MDP, observing states and actions as trajectories; (2) Receive labels for each state from the supervisor ; (3) Update according to the supervised learning loss to generate .Onpolicy imitation learning has often been viewed as an instance of online optimization or online learning [14, 16, 17]. Online optimization is posed as a game between an adversary that generates a loss function at iteration and an algorithm that plays a policy in an attempt to minimize the total incurred losses. After observing , the algorithm updates its policy for the next iteration. In the context of imitation learning, the loss at iteration corresponds to the supervised learning loss function under the current policy. The loss function can then be used to update the policy for the next iteration. The benefit of reducing onpolicy imitation learning to online optimization is that wellstudied analyses and regret metrics from online optimization can be readily applied to understand and improve imitation learning algorithms. Next, we outline a theoretical framework in which to study onpolicy imitation learning with a converging supervisor.
3.2 Converging Supervisor Framework (CSF)
We begin by presenting a set of definitions for onpolicy imitation learning with a converging supervisor in order to analyze the static regret (Section 4.1) and dynamic regret (Section 4.2) that can be achieved in this setting. In this paper, we assume that policies are parameterized by a parameter from a convex compact set equipped with the norm, which we denote with
for simplicity for both vectors and operators.
Definition 3.1.
Supervisor: We can think of a converging supervisor as a sequence of supervisors (labelers), , where defines a deterministic controller which maps an input state in to an action in . Supervisor provides labels for imitation learning policy updates at iteration .
Definition 3.2.
Learner: The learner is represented at iteration by a parameterized policy where is differentiable function in the policy parameter .
We denote the state and action at timestep in the trajectory sampled at iteration by the learner with and respectively.
Definition 3.3.
Losses: We consider losses at each round of the form: where defines the distribution of trajectories generated by . Gradients of with respect to are defined as
For analysis of the converging supervisor setting, we adopt the following standard assumptions. The assumptions in this section and the loss formulation are consistent with those in Hazan [24] and Ross et al. [14] for analysis of online optimization and imitation learning algorithms. The loss incurred by the agent is the population risk of the policy, and extension to empirical risk can be derived via concentration inequalities as in Ross et al. [14].
Assumption 3.1.
Strongly convex losses: is strongly convex with respect to with parameter . Precisely, we assume that
The expectation over preserves strong convexity of the squared loss for an individual sample, which is assumed to be convex in .
Assumption 3.2.
Bounded operator norm of policy Jacobian: , where .
Assumption 3.3.
Bounded action space: The action space and has diameter . Equivalently stated:
4 Regret Analysis
We analyze the performance of wellknown algorithms in onpolicy imitation learning and online optimization under the converging supervisor framework. In this setting, we emphasize that the goal is to achieve low loss with respect to labels from the last observed supervisor . We achieve these results through regret analysis via reduction of onpolicy imitation learning to online optimization, where regret is a standard notion for measuring the performance of algorithms. We consider two forms: static and dynamic regret [25], both of which have been utilized in previous onpolicy imitation learning analyses [14, 16]. In this paper, regret is defined with respect to the expected losses under the trajectory distribution induced by the realized sequence of policies . Standard concentration inequalities can be used for finite sample analysis as in Ross et al. [14].
Using static regret, we can show a loose upper bound on average performance with respect to the last observed supervisor with minimal assumptions, similar to [14]. Using dynamic regret, we can tighten this upper bound, showing that is optimal in expectation on its own distribution with respect to for certain algorithms, similar to [16, 22]; however, to achieve this stronger result, we require an additional continuity assumption on the dynamics of the system, which was shown to be necessary by Cheng and Boots [17]. To harness regret analysis in imitation learning, we seek to show that algorithms achieve sublinear regret (whether static or dynamic), denoted by where is the number of iterations. That is, the regret should grow at a slower rate than linear in the number of iterations. While existing algorithms can achieve sublinear regret in the fixed supervisor setting, we analyze regret with respect to the last observed supervisor , even though the learner is only provided labels from the intermediate ones during learning. See supplementary material for all proofs.
4.1 Static Regret
Here we show that as long as the supervisor labels are Cauchy in expectation, i.e. if where , it is possible to achieve sublinear static regret with respect to the best policy in hindsight with labels from for the whole dataset. This is a more difficult metric than is typically considered in regret analysis for onpolicy imitation learning since labels are provided by the converging supervisor at iteration but regret is evaluated with respect to the best policy given labels from . Past work has shown that it is possible to obtain sublinear static regret in the fixed supervisor setting under strongly convex losses for standard onpolicy imitation learning algorithms such as online gradient descent [24] and DAgger [14]; we extend this and show that the additional asymptotic regret in the converging supervisor setting depends only on the convergence rate of the supervisor. The standard notion of static regret is given in Definition 4.1.
Definition 4.1.
The static regret with respect to the sequence of supervisors is given by the difference in the performance of policy and that of the best policy in hindsight under the average trajectory distribution induced by the incurred losses with labels from current supervisor .
However, we instead analyze the more difficult regret metric presented in Definition 4.2 below.
Definition 4.2.
The static regret with respect to the supervisor is given by the difference in the performance of policy and that of the best policy in hindsight under the average trajectory distribution induced by the incurred losses with labels from the last observed supervisor .
Theorem 4.1.
can be bounded above as follows:
Theorem 4.1 essentially states that the expected static regret in the converging supervisor setting can be decomposed into two terms: one that is the standard notion of static regret, and an additional term that scales with the rate at which the supervisor changes. Thus, as long as there exists an algorithm to achieve sublinear static regret on the standard problem, the only additional regret comes from the evolution of the supervisor. Prior work has shown that algorithms such as online gradient descent [24] and DAgger [14] achieve sublinear static regret under strongly convex losses. Given this reduction, we see that these algorithms can also be used to achieve sublinear static regret in the converging supervisor setup if the extra term is sublinear. Corollary 4.1 identifies when this is the case.
Corollary 4.1.
If , then can be decomposed as follows:
4.2 Dynamic Regret
Although the static regret analysis provides a bound on the average loss, the quality of that bound depends on the term , which in practice is often very large due to approximation error between the policy class and the actual supervisor. Furthermore, it has been shown that despite sublinear static regret, policy learning may be unstable under certain dynamics [17, 21]. Recent analyses have turned to dynamic regret [16, 17], which measures the suboptimality of a policy on its own distribution: . Thus, low dynamic regret shows that a policy is on average performing optimally on its own distribution. This framework also helps determine if policy learning will be stable or if convergence is possible [16]. However, these notions require understanding the sensitivity of the MDP to changes in the policy. We quantify this with an additional Lipschitz assumption on the trajectory distributions induced by the policy as in [16, 17, 22]. We show that even in the converging supervisor setting, it is possible to achieve sublinear dynamic regret given this additional assumption and a converging supervisor by reducing the problem to a predictable online learning problem [22]. Note that this yields the surprising result that it is possible to do well on each round even against a dynamic comparator which has labels from the last observed supervisor. The standard notion of dynamic regret is given in Definition 4.3 below.
Definition 4.3.
The dynamic regret with respect to the sequence of supervisors is given by the difference in the performance of policy and that of the best policy under the current round’s loss, which compares the performance of current policy and current supervisor .
However, similar to the static regret analysis in Section 4.1, we seek to analyze the dynamic regret with respect to labels from the last observed supervisor , which is defined as follows:
Definition 4.4.
The dynamic regret with respect to supervisor is given by the difference in the performance of policy and that of the best policy under the current round’s loss, which compares the performance of current policy and last observed supervisor .
We first show that there is a reduction from to .
Lemma 4.1.
can be bounded above as follows:
Given the notion of supervisor convergence discussed in Corollary 4.1, Corollary 4.2 shows that if we can achieve sublinear , we can also achieve sublinear .
Corollary 4.2.
If , then can be decomposed as follows:
It is well known that cannot be sublinear in general [16]. However, as in [16, 17], we can obtain conditions for sublinear regret by leveraging the structure in the imitation learning problem with a Lipschitz continuity condition on the trajectory distribution. Let denote the total variation distance between two trajectory distributions and .
Assumption 4.1.
There exists such that the following holds on the trajectory distributions induced by policies parameterized by and :
A similar assumption is made by popular RL algorithms [26, 27], and Lemma 4.2 shows that with it, sublinear can be achieved using results from predictable online learning [22].
Lemma 4.2.
If Assumption 4.1 holds and , then there exists an algorithm where . If the diameter of the parameter space is bounded, the greedy algorithm achieves sublinear . Furthermore, if the losses are smooth in and , then online gradient descent achieves sublinear .
Finally, we combine the results of Corollary 4.2 and Lemma 4.2 to conclude that since we can achieve sublinear and have found a reduction from to , we can also achieve sublinear dynamic regret in the converging supervisor setting.
Theorem 4.2.
If If and under the assumptions in Lemma 4.2, there exists an algorithm where . If the diameter of the parameter space is bounded, the greedy algorithm achieves sublinear . Furthermore, if the losses are smooth in and , then online gradient descent achieves sublinear .
5 Converging Supervisors for Deep Continuous Control
Sun et al. [13] apply DPI to continuous control tasks, but assume that both the learner and supervisor are of the same policy class and from a class of distributions for which computing the KLdivergence is computationally tractable. This limited choice of supervisors makes it hard to achieve comparable results to stateoftheart deep RL algorithms. In contrast, the converging supervisor framework does not constrain the structure of the supervisor, making it possible to use any converging, improving supervisor (algorithmic or human) with no additional engineering effort. Note that for this framework to be useful, we also implicitly assume that the supervisor’s labels improve with respect to the MDP reward function, , when trained with data on the learner’s distribution; this assumption is validated by experimental results in this paper and those in prior work [11, 12]. In practice, we can encourage the supervisor to improve on the learner’s distribution with respect to by adding noise to the learner to cover the MDP enough for the supervisor to learn the dynamics sufficiently well.
We utilize the converging supervisor framework (CSF) to motivate an algorithm that uses the stateoftheart deep MBRL algorithm, PETS, as an improving supervisor. Note that while for analysis we assume a deterministic supervisor, PETS produces stochastic supervision for the agent. We observe that this does not detrimentally impact performance of the policy in practice. PETS was chosen since it has demonstrated superior data efficiency compared to other deep RL algorithms [8]. We collect policy rollouts from a modelfree learner policy and refit the policy on each episode using DAgger [14] with supervision from PETS, which maintains a trained dynamics model based on the transitions collected by the learner. Supervision is generated via MPC by using the cross entropy method to plan over the learned dynamics for each state in the learner’s rollout, but is collected after the rollout has completed rather than at each timestep of every policy rollout to reduce online computation time.
6 Experiments
The method presented in Section 5 uses the Converging Supervisor Framework (CSF) to train a learner policy to imitate a PETS supervisor trained on the learner’s distribution. We expect the CSF learner to be less data efficient than PETS, but have significantly faster policy evaluation time. To evaluate this expected loss in data efficiency, we measure the gap in data efficiency between the learner on its own distribution (CSF learner), the supervisor on the learner’s distribution (CSF supervisor) and the supervisor on its own distribution (PETS). Returns for the CSF learner and CSF supervisor are computed by rolling out the modelfree learner policy and modelbased controller after each training episode. Because the CSF supervisor is trained entirely on offpolicy data from the learner, the difference between the CSF learner and CSF supervisor performance measures how effectively the CSF learner is able to track the CSF supervisor’s performance. The difference in performance between the CSF supervisor and PETS measures how important onpolicy data is for PETS to generate good labels. All runs are repeated 3 times to control for stochasticity in training; see supplementary material for further experimental details. The DPI algorithm in Sun et al. [13] did not perform well on the presented environments, so we do not report a comparison to it. However, we compare against the following set of 3 stateoftheart modelfree and modelbased RL baselines and demonstrate that the CSF learner maintains the data efficiency of PETS while reducing online computation time significantly by only collecting policy rollouts from the fast modelfree learner instead of from the PETS supervisor.

[topsep=0pt, noitemsep]

Soft Actor Critic (SAC): Stateoftheart maximum entropy modelfree RL algorithm [28].

ModelEnsemble Trust Region Policy Optimization (METRPO): Stateoftheart modelfree, modelbased RL hybrid algorithm using a set of learned dynamics models to update a closedloop policy offline with modelfree RL [27].
6.1 Simulation Experiments
We consider the PR2 Reacher and Pusher continuous control MuJoCo domains from Chua et al. [8] (Figure 1
). For both tasks, the CSF learner outperforms other stateoftheart deep RL algorithms, demonstrating that the CSF learner enables fast policy evaluation while maintaining sample efficient learning. The CSF learner closely matches the performance of both the CSF supervisor and PETS, indicating that the CSF learner has similar data efficiency as PETS. Results using a neural network CSF learner suggest that stronglyconvex losses may not be necessary in practice.
Training curves for the CSF learner, CSF supervisor, PETS, and baselines for the MuJoCo Reacher (top) and Pusher (bottom) tasks for a linear (left) and neural network (NN) policy (right). The linear policy is trained via ridgeregression with regularization parameter
, satisfying the stronglyconvex loss assumption in Section 3. To test more complex policy representations, we repeat experiments with a neural network (NN) learner with 2 hidden layers with 20 hidden units each. The CSF learner is able to successfully track the CSF supervisor on both domains, but also performs well compared to PETS and outperforms other baselines with both policy representations. The CSF learner is slightly less data efficient, but policy evaluation is up to 80x faster than PETS. SAC, TD3, and METRPO use a neural network policy/dynamics class.This result is promising because if the modelfree learner policy is able to achieve similar performance to the supervisor on its own distribution, we can simultaneously achieve the data efficiency benefits of MBRL and the low online computation time of modelfree methods. To quantify this speedup, we present timing results in Table 1, which demonstrate that a significant speedup (up to 80x in this case) in policy evaluation is possible. Note that although we still need to evaluate the modelbased controller on each state visited by the learner to generate labels, since this only needs to be done offline, this can be parallelized to reduce offline computation time as well.
6.2 Physical Robot Experiments
We also test CSF with a neural network policy on a physical da Vinci Surgical Robot (dVRK) [31] to evaluate its performance on multigoal tasks where the end effector must be controlled to desired positions in the workspace. We evaluate the CSF learner/supervisor and PETS on the physical robot for both single and double arm versions of this task, and find that the CSF learner is able to track the PETS supervisor effectively (Figure 2) and provide up to a 22x speedup in policy query time (Table 1). We expect the CSF learner to demonstrate significantly greater speedups relative to standard deep MBRL for higher dimensional tasks and for systems where higherfrequency commands are possible.
PR2 Reacher (Sim)  PR2 Pusher (Sim)  dVRK Reacher  dVRK DoubleArm Reacher  

CSF Learner  
PETS 
7 Conclusion
We formally introduce the converging supervisor framework for onpolicy imitation learning and show that under standard assumptions, we can achieve sublinear static and dynamic regret against the best policy in hindsight with labels from the last observed supervisor, even when labels are only provided by the converging supervisor during learning. We then show that there is a connection between the converging supervisor framework and DPI, and use this to present an algorithm to accelerate policy evaluation for modelbased RL without making any assumptions on the structure of the supervisor. We use the stateoftheart deep MBRL algorithm, PETS, as an improving supervisor and maintain its sample efficiency while significantly accelerating policy evaluation. Finally, we evaluate the efficiency of the method by successfully training a policy on a multigoal reacher task directly on a physical surgical robot. The provided analysis and framework suggests a number of interesting questions regarding the degree to which nonstationary supervisors affect policy learning. In future work, it would be interesting to derive specific convergence guarantees for the converging supervisor setting, consider different notions of supervisor convergence, and study the tradeoffs between supervision quality and quantity.
References

Finn et al. [2017]
C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine.
Oneshot visual imitation learning via metalearning.
In S. Levine, V. Vanhoucke, and K. Goldberg, editors,
Proceedings of the 1st Annual Conference on Robot Learning, volume 78
of
Proceedings of Machine Learning Research
, pages 357–368. PMLR, 13–15 Nov 2017. URL http://proceedings.mlr.press/v78/finn17a.html.  Liu et al. [2018] Y. Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In ICRA, pages 1118–1125. IEEE, 2018.
 Yu et al. [2018] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine. Oneshot imitation from observing humans via domainadaptive metalearning. In ICLR (Workshop). OpenReview.net, 2018.
 Zhang et al. [2018] T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Y. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8, 2018.
 Gao et al. [2014] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh, et al. Jhuisi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In MICCAI Workshop: M2CAI, volume 3, page 3, 2014.
 Kahn et al. [2017] G. Kahn, T. Zhang, S. Levine, and P. Abbeel. Plato: Policy learning using adaptive trajectory optimization. 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3342–3349, 2017.
 Pan et al. [2018] Y. Pan, C.A. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and B. Boots. Agile autonomous driving via endtoend deep imitation learning. In Proceedings of Robotics: Science and Systems (RSS), 2018.
 Chua et al. [2018] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. NeurIPS, abs/1805.12114, 2018. URL http://arxiv.org/abs/1805.12114.
 Nagabandi et al. [2018] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. ICRA, 2018.
 Thananjeyan et al. [2019] B. Thananjeyan, A. Balakrishna, U. Rosolia, F. Li, R. McAllister, J. E. Gonzalez, S. Levine, F. Borrelli, and K. Goldberg. Extending deep model predictive control with safety augmented value estimation from demonstrations. arXiv preprint arXiv:1905.13402, 2019.

Anthony et al. [2017]
T. Anthony, Z. Tian, and D. Barber.
Thinking fast and slow with deep learning and tree search.
In Advances in Neural Information Processing Systems, pages 5360–5370, 2017.  Silver et al. [2017] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Sun et al. [2018] W. Sun, G. J. Gordon, B. Boots, and J. Bagnell. Dual policy iteration. In Advances in Neural Information Processing Systems, pages 7059–7069, 2018.
 Ross et al. [2011] S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, 2011.
 Sun et al. [2017] W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell. Deeply AggreVaTeD: Differentiable imitation learning for sequential prediction. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3309–3318, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Lee et al. [2019] J. Lee, M. Laskey, A. K. Tanwani, A. Aswani, and K. Y. Goldberg. A dynamic regret analysis and adaptive regularization algorithm for onpolicy robot imitation learning. WAFR, 2019.
 Cheng and Boots [2018] C. Cheng and B. Boots. Convergence of value aggregation for imitation learning. International Conference on Artificial Intelligence and Statistics, abs/1801.07292, 2018. URL http://arxiv.org/abs/1801.07292.

Ross et al. [2011]
S. Ross, G. J. Gordon, and J. A. Bagnell.
A reduction of imitation learning and structured prediction to
noregret online learning.
International Conference on Artificial Intelligence and Statistics
, 2011.  Bagnell [2015] J. A. D. Bagnell. An invitation to imitation. Technical Report CMURITR1508, Carnegie Mellon University, Pittsburgh, PA, March 2015.
 Pomerleau [1989] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
 Laskey et al. [2017] M. Laskey, C. Chuck, J. Lee, J. Mahler, S. Krishnan, K. Jamieson, A. Dragan, and K. Goldberg. Comparing humancentric and robotcentric sampling for robot deep learning from demonstrations. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 358–365. IEEE, 2017.
 Cheng et al. [2019] C. Cheng, J. Lee, K. Goldberg, and B. Boots. Online learning with continuous variations: Dynamic regret and reductions. CoRR, abs/1902.07286, 2019.
 Jacq et al. [2019] A. Jacq, M. Geist, A. Paiva, and O. Pietquin. Learning from a learner. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2990–2999, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/jacq19a.html.
 Hazan [2016] E. Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Zinkevich [2003] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.
 Schulman et al. [2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 Kurutach et al. [2018] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Modelensemble trustregion policy optimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SJJinbWRZ.
 Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholm, 2018.
 Fujimoto et al. [2018] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actorcritic methods. In ICML, 2018.
 Lillicrap et al. [2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015. URL http://arxiv.org/abs/1509.02971.
 Kazanzides et al. [2014] P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio. An opensource research kit for the da Vinci surgical system. In IEEE Intl. Conf. on Robotics and Auto. (ICRA), pages 6434–6439, Hong Kong, China, 2014.
 Chua [2018] K. Chua. Experiment code for ”deep reinforcement learning in a handful of trials using probabilistic dynamics models”. https://github.com/kchua/handfuloftrials, 2018.
 Pong [20182019] V. Pong. rlkit. https://github.com/vitchyr/rlkit, 20182019.
 Kurutach [2019] T. Kurutach. Modelensemble trustregion policy optimization (metrpo). https://github.com/thanard/metrpo, 2019.
 Seita et al. [2018] D. Seita, S. Krishnan, R. Fox, S. McKinley, J. Canny, and K. Goldberg. Fast and reliable autonomous surgical debridement with cabledriven robots using a twophase calibration procedure. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6651–6658, May 2018. doi: 10.1109/ICRA.2018.8460583.
Appendix A Static Regret
a.1 Proof of Theorem 4.1
Recall the standard notion of static regret as defined in Definition 4.1:
(1) 
Notice that this corresponds to the static regret of the agent with respect to the losses parameterized by the last observed supervisor . We can do this as follows:
(3)  
(4)  
(5)  
(6) 
Here, inequality 6 follows from the fact that . Now, we can focus on bounding the extra term. Let .
(7)  
(8)  
(9) 
(10)  
(11)  
(12) 
Equation 8 follows from applying the definition of the loss function. Inequality 9 follows from applying convexity of in . Equation 10 follows from evaluating the corresponding gradients. Inequality 11 follows from CauchySchwarz and inequality 12 follows from the action space bound. Thus, we have:
(13) 
a.2 Proof of Corollary 4.1
(14) 
implies that
(15) 
This in turn implies that
(16) 
Remark: For sublinearity, we really only need Inequality 15 to hold. Due to the dependence of on the parameter of the policy at iteration , we tighten this assumption with the stricter Cauchy condition 14 to remove the dependence of a component of the regret on the sequence of policies used.
The Additive Cesàro’s Theorem states that if the sequence has a limit, then
Thus, we see that if , then it must be the case that . This shows that for some converging to 0, it must be the case that
Thus, based on the regret bound in Theorem 4.1, we can achieve sublinear for any sequence which converges to 0 given an algorithm that achieves sublinear :
∎
Appendix B Dynamic Regret
b.1 Proof of Lemma 4.1
Recall the standard notion of dynamic regret as defined in Definition 4.3:
(17) 
Notice that this corresponds to the dynamic regret of the agent with respect to the losses parameterized by the most recent supervisor . We can do this as follows:
(19)  
(20)  
(21)  
(22) 
Here, inequality 22 follows from the fact that . Now as before, we can focus on bounding the extra term. Let .
(23)  
Comments
There are no comments yet.