No Need for Interactions: Robust Model-Based Imitation Learning using Neural ODE

04/03/2021 ∙ by HaoChih Lin, et al. ∙ 10

Interactions with either environments or expert policies during training are needed for most of the current imitation learning (IL) algorithms. For IL problems with no interactions, a typical approach is Behavior Cloning (BC). However, BC-like methods tend to be affected by distribution shift. To mitigate this problem, we come up with a Robust Model-Based Imitation Learning (RMBIL) framework that casts imitation learning as an end-to-end differentiable nonlinear closed-loop tracking problem. RMBIL applies Neural ODE to learn a precise multi-step dynamics and a robust tracking controller via Nonlinear Dynamics Inversion (NDI) algorithm. Then, the learned NDI controller will be combined with a trajectory generator, a conditional VAE, to imitate an expert's behavior. Theoretical derivation shows that the controller network can approximate an NDI when minimizing the training loss of Neural ODE. Experiments on Mujoco tasks also demonstrate that RMBIL is competitive to the state-of-the-art generative adversarial method (GAIL) and achieves at least 30 performance gain over BC in uneven surfaces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For the majority of the recent Imitation Learning (IL) works, interactions with environments are considered, and several algorithms have been proposed to solve the IL problem in this context, such as Inverse Reinforcement Learning (IRL)

[18, 1]. The state-of-the-art Generative Adversarial Imitation Learning (GAIL) [12] is also based on prior IRL works. The success of GAIL gains popularity of the adversarial IL (AIL) framework in IL research field [16, 7, 35, 14]. However, the reinforcement loop inside AIL method has a risk of driving the learned policy to visit unsafe or undefined states-space during training. As such, in this work, we attend the scenario where the imitated policy is NOT allowed to interact with environments or access information from the expert policy in training phase.

For the scenarios where interactions are not accessible, a common approach is still the Behavior Cloning [20, 24]

, which imitates the expert by approximating the conditional distribution of actions over states in a supervised learning fashion. With sufficient demonstrations collected from the expert, the BC methods have successfully found its wide applications in autonomous driving

[4] and robot locomotion [32]. Nevertheless, the robustness of BC is not guaranteed because of the compounding errors caused by covariate shift issue [22].

Fig. 1:

Concepts behind the proposed RMBIL framework. Green block represents physical environment and yellow block represents the neural network. Red dash block indicates the target of the imitation learning. (a) Typical framework for Model-Based Imitation Learning (MBIL). (b) Classical block diagram for Nonlinear Dynamics Inversion (NDI) Controller framework, where the reference model (RM) is equivalent to the Trajectory Generator block in MBIL. A linear Controller and NDI block are formulated together as the Tracking Controller. (c) & (d) Outline of the proposed RMBIL framework at training and inference phase, respectively,

Where the odeint block is a third-party numerical ode integrator.

Some efforts have been made to address the compounding errors issue under BC framework. Ross and Bagnell proposed DAgger [22] that enables the expert policy to correct the behaviors of imitated policy during training. Mahler and Goldberg [15] introduced Disturbances for Augmenting Robot Trajectory (DART) that applies the expert policy to generate sub-optimal demonstrations. Torabi et al. presented BCO [32] with an inverse dynamics model for inferring actions from observations through environment exploration. Nevertheless, all these algorithms require interactions with either the environment or the expert policy during training, which is against our problem settings.

In order to effectively utilize the embedded physical information from the finite expert demonstrations, we choose the model-based scheme rather than BC like or model-free approaches. However, performance of model-based approach is directly influenced by accuracy of the learned system dynamics, as well as stability and robustness of learned controller. In the deep learning field, such properties are hard to be verified and guaranteed. Therefore, we borrow concepts and definitions from the nonlinear control theories so as to analyze the proposed framework.

As depicted by Osa et al. [19], the model-based imitation learning (MBIL) problem could be considered as a typical closed-loop tracking control problem, which is composed of system dynamics, tracking controller, and trajectory generator (shown in Fig.1-(a)). In traditional nonlinear control field, such tracking control problem can be transformed into an equivalent linear system through Nonlinear Dynamics Inversion (NDI) [29, 8]) algorithm (shown in Fig.1-(b)). As a result, the transformed linear system can be flexibly controlled by a linear controller. By applying NDI concept to MBIL problem, we could adopt some nonlinear control methodology, such as stability and robustness analysis [8]. However, the stability of NDI is guaranteed if the dynamics model matches the physical system [26].

The recent Neural ODE research [6] presents a new framework to learn a precise multi-steps continuous system dynamics with irregular time-series data. With this seminal theory, Rubanova et al. [23] and Zhong et al. [36] further proposed some improvements in their works. As one key part of this work, we advocate a Neural ODE based purely data-driven approach for learning a multi-steps continuous actuated dynamics by applying the zero-order-hold to the control inputs. Based on the precise learned dynamics, we propose the Robust Model-Based Imitation Learning (RMBIL) framework, which formulates MBIL as an end-to-end differentiable nonlinear tracking control problem via NDI algorithm, as illustrated in Fig.1-(c), where the system dynamics and NDI control policy are trained with a closed-loop Neural ODE model. On the other hand, the trajectory generator block in MBIL methodology is equivalent to the Reference Model (RM) in NDI framework. The RM module is usually a hand-designed kinodynamics trajectory approximator which mimics the expert’s behaviors via a high-order polynomial function. Inspired by recent works in trajectories predictions [33, 3, 9]

, we further adopt the Conditional Variational Autoencoder (CVAE)

[30, 13], conditioned on the previous state, to replace the hand-designed RM block. At inference, only the learned NDI controller and the decoder from the trained CVAE are used for imitating expert’s trajectories, as shown in Fig.1-(d). In addition, to increase robustness of the proposed framework, we take the concept of the sliding surface from the Sliding Mode Control (SMC), which is a traditional nonlinear robust control algorithm [27]. Empirically, we find that the sliding surface could be approximated by injecting noise to the system state during the training of the NDI controller with a closed-loop Neural ODE model. As a result, RMBIL obtains an approximated 30% performance increase over BC, and is competitive to GAIL approach, in the disturbed environments.

Ii Preliminary

Given expert demonstrations , where is a finite sequence of state-action pairs for samples, is sampled from state trajectory , is sampled from policy outputs , and

is a context vector that represents the initial state

of sequence . A common method to solve MBIL is to explicitly train a discrete forward dynamics: , as well as a tracking controller: , where the reference state is provided by a trajectory generator: , for all tasks. Since all models should be independent of , without losing the generality, we assume for the following derivation in order to remove superscript for simplicity.

Ii-a Nominal Control: Nonlinear Dynamics Inversion

The main concept behind NDI is to cancel nonlinearities of a system via input-output linearization [27]. To review the theory of NDI, consider a continuous input-affine system defined as:

(1)
(2)

where is the output vector, and are smooth vector fields, and is a matrix whose columns are smooth vector fileds. To find the explicit relation between the outputs and the control inputs , Eq. (2) will be differentiated repeatedly (for simplicity, we denote ).

(3)

where is the gradient operator. If the term in Eq.(3) is not zero, then the input-output relation is found. To achieve desired outputs , Eq.(3) is reformulated as:

(4)

where is a virtual input that is tracked by the derivative of output . Hence, by controlling , the corresponding in Eq.(4) would take in Eq.(2) to the desired output through Eq.(1). A common approach to obtain suitable is using a linear proportional feedback controller:

(5)

where is a diagonal matrix with a manually chosen gain value. Eq.(4) will hold if is invertible. If is a non-square matrix, then the pseudo-inverse method may be applied. In addition, we assume our system is fully-observable and , which are conventions under MDP assumption in IL field [12, 34, 32]. Therefore,

is an identity matrix, and Eq.(

4) could be rewritten as follows:

(6)

where is a function of and according to Eq.(5). By applying Eq.(6) to the physical platform, the whole system becomes linear and can be flexibly controlled by Eq.(5). However, stability of a NDI controller is guaranteed if the dynamics model (Eq.(1) and (2)) matches the physical platform [26].

Ii-B Ancillary Control: Sliding Mode Control

The idea behind SMC is to design a nonlinear controller to force the disturbed system to slide along the desired trajectory. As defined in [28], the SMC scheme consists of two steps: (1) Find a sliding surface such that the system will stay confined within it when the system trajectory reaches the surface within the finite time. In our case , where for the fully-observable system. (2) Design a switching function , which represents the distance between the current state and sliding surface, for the feedback control law to lead the system trajectory to intersect and stay on the . In this work, we define to satisfy the condition that when .

The sufficient condition [28] for the global stability of SMC is . To satisfy the condition and dynamics model (Eq.(1)), the resulting nonlinear robust control policy is designed as follows (Denote ):

(7)
(8)
(9)

where is a positive definite matrix. Therefore, by adding an additional feedback term , which is dependent on the switching function , to the nominal NDI controller , we may obtain a robust NDI control law . In addition, Eq.(9) also implies that would be identical to when .

Iii Methodology

Fig. 2: Actuated Neural ODE Model. Yellow block is a neural network for computing the time derivative of actuated dynamics. Green block is a continuous function approximated by ZOH method based on discrete input sequence . Although the internal state is continuous, the predicted outputs are discrete with regard to the specified time sequence . Hence, the loss can be computed based on both discrete trajectories.

Iii-a Multi-steps Actuated Dynamics using Neural ODE

Instead of learning a discrete dynamics, we are interested in a continuous-time dynamics with control inputs that can be formulated as: . To approximate such differential dynamical systems with a neural network , we adopt the Neural ODE model proposed by Chen et al. [6], which solves the initial-value problem (Eq.(10

)) in a supervised learning fashion by backpropagating through a black-box ODE solver using adjoint sensitivity method

[21] for memory efficiency.

(10)

To update the dynamics with respect to weights

, the loss function

can be constructed as a L2-norm between the predicted state and the true state for a certain time-horizon , that is: . However, Eq.(10) can only handle an autonomous system . In order to apply Neural ODE to an actuated dynamics, Rubanova et al.[23] proposed a RNN based framework to map the system into latent space and solve the latent autonomous dynamics, while Zhong et al.[36] introduced an augmented dynamics by appending the control input to the system state under a strong assumption that the control signal must stay constant over the time-horizon .

In this work, we propose a more intuitive and general method to handle the actuated dynamics. We construct a continuous input function based on the sampled control signal by zero-order-hold (ZOH). For each internal integration time inside Neural ODE, the actuated dynamics could always obtain the corresponding control signal by accessing the input function , as illustrated in Fig. 2. As such, the Neural ODE (Eq.(10)) could be directly applied to an arbitrary dynamic system with external controls without any additional assumption or constraint. Once with , where is the integration step of ODE solver, we obtain a precise multi-steps continuous actuated dynamics based on discrete expert demonstrations

. Unfortunately, compared to the baseline model such as multilayer perceptron (MLP), the integration solver inside Neural ODE becomes the bottleneck of inference time, which limits the learned dynamics to be integrated into classic model-based planning and control scheme such as Model Predictive Control (MPC).

[17, 10, 2].

Iii-B Learning based NDI Controller via Neural ODE

To overcome the above inference time bottleneck caused by the ODE solver, we train a controller network via the learned dynamics with the Neural ODE (Eq.(10)). As a result, only the learned controller will be adopted for controlling the physical system. Based on this concept, the Proposition III.1 expresses how a control policy, parameterized by , approximates the NDI formulation (Eq.(6)).

Proposition III.1

NDI Controller Training
Assume , and suppose , then if along trajectories, where .

Fig. 3:

Performance comparison of proposed MBIL and RMBIL versus baselines with respect to the number of demonstrations. The x-axis is the number of expert demonstration used in the training, and y-axis is the normalized rewards (expert as one and random as zero). The shadow area beside each line represents its variance calculated based on 50 test trajectories. Note: GAIL needs environments interactions during training, while RMBIL does not.

Proposition III.1 implies that by minimizing the loss , the NDI controller is learned with Neural ODE if the trained dynamics is accurate and can be expressed as an affine system. To prove this proposition, we start from the loss function with respect to the controller weight .

(11)

The first term at RHS of Eq.(11) is zero since is given as an initial condition. We let without loss of generality, and the results can be easily extended to multi-steps case.

(12)
(13)

The training goal of Neural ODE, which belongs to the supervised learning family, is to minimize the loss . According to Eq.(13), the loss will be the minimum if the first term equals to (or approximates) the second term as follows:

(14)

In order to obtain the NDI formulation from Eq.(14), we introduce the first assumption that the equal sign still holds after applying the time derivative on both sides of Eq.(14) if along trajectories, where . In general, for arbitrary two functions whose value are identical at a certain point, the values of time derivative at the same point are not guaranteed to be equal. However, in Eq.(14), the state-action pairs’ sequence data used on both sides is from the same demonstrations . Therefore, if the training loss of the learned dynamics approaches zero with (in practice, we stop the training as , where ), then both sides of Eq.(14) represent the same system trajectory . Hence, we could apply the time derivative operator to Eq.(14) and the equal sign is still held as follows:

(15)
(16)

To further derive Eq.(16), we introduce the second assumption that the true dynamics, where the proposed RMBIL is trying to mimic, is an input-affine system. This assumption is roughly satisfied with most physical controllable platforms. Under this assumption, Eq.(16) can be reformulated as:

(17)
(18)

where from Eq.(5). By substituting the controller network for the expected control inputs , the relation to the NDI formulation emerges from Eq.(18). Since the two assumptions discussed above may not be fulfilled perfectly in practice, we replace the equal sign with an approximation sign. In addition, we reformulate Eq.(18) to bring to the LHS:

(19)
(20)

where Eq.(20) is derived from replacing the RHS of Eq.(19) with Eq.(6). As a result, with Eq.(14), (15) and (20), Proposition III.1 is validated. We have to emphasize here that is only applied to the state space where we observed from the dataset . Although the proposition indicates that a learned policy with Neural ODE can be treated as a NDI controller, the robustness of the controller is not guaranteed [26].

Iii-C Robustness Improvement through Noise Injection

To improve robustness of the learned NDI controller , we borrow the concept of SMC discussed in Section II-B. Rather than building a hierarchical ancillary controller, we intend to design a robust NDI policy network which is end-to-end differentiable through Neural ODE. Inspired by DART [15], we refine the trained NDI controller by adding zero-mean gaussian noise to the internal state within Neural ODE so as to construct the switching function (Eq.(7)). As stated with proposition III.2, the robustness of the refined controller would be improved because an ancillary SMC policy (Eq.(9)) is formulated automatically when (the training loss for ) approaching zero. Due to space limits, please refer to the link 111https://github.com/haochihlin/ICRA2021/blob/master/Appendix.pdf for the proof details.

0:  dataset , for noise injection, for solver horizon, and as convergence indicator. // Multi-steps dynamics training
1:  Initialize model parameters and randomly
2:  Bypassing the controller model
3:  while  do
4:     Predict trajectories using Eq.(10)
5:     Compute loss
6:     Update the dynamics model via Neural ODE
7:  end while// NDI controller training
8:  Freeze trained dynamics , enable controller
9:  while  do
10:     Predict trajectories via Eq.(10) with trained
11:     Update based on
12:  end while// Controller robustness enhancement
13:  while  do
14:     Sample noised internal state
15:     Predict trajectories using trained and
16:     Refine based on
17:  end while
Algorithm 1 Dynamics and Controller Training
Proposition III.2

Controller Robustness Improvement
Refine the learned controller

with noised injected state, sampled from Gaussian distribution,

, under finite training epochs, then

when , where .

Iii-D Conditional VAE for Trajectory Generation

The training objective for the trajectory generator in MBIL framework is to predict the reference state given the past states. To support multi-task scenario, inspired by [3, 34] , we use CVAE [30] to predict the future trajectory , but without the need for embedding multi-steps dynamics information because the learnable dynamics and robust controller have carried on such information. At training, the generative mode of the proposed CVAE , parameterized by , encodes the current state to an embedding vector conditioned on the variable , which in our case is the previous state . Given , the inference network , prarmeterized by , decodes the current state under the same condition variable . Both network parameters ( and ) are updated by minimizing the following loss:

(21)

where

is the Kullback-Leibler divergence between the approximated posterior and conditional prior

which is assumed as the standard Gaussian distribution . In addition, the first term on the RHS of Eq.(21) represents the reconstruction loss.

For inference, by feeding the current system state as the context vector , the trained inference network would predict the state at next time step which can be treated as the reference for the tracking controller , where the embedding vector is sampled from the assumed prior .

Iii-E Procedure for Training and Inference

The training pipeline is composed of two main modules, the Neural ODE based dynamics-controller module and the CVAE based generator module. As described in Algorithm.1, the dynamics-controller module is trained in three phases with Neural ODE model. (1) Starting by training the dynamics model . (2) Once , enable the control policy network to learn a NDI controller. (3) When the loss , start adding noise to the internal state to obtain a robust controller . The method to apply the noise injection inside Neural ODE is illustrated in Fig.1-(c). In contrast, the training procedure of the generator module is straightforward, that is, minimizing the loss defined in Eq.(21) until both reconstruction loss and KL divergence converge. For inference, as illustrated in Fig.1-(d), only the trained robust tracking controller and the trained inference network are used for driving the physical platform to mimic expert’s behavior given initial state .

Fig. 4:

Robustness evaluation for RMBIL (with different feedback gains) versus baselines. The y-axis is the normalized rewards. For each method, we plot average (the marker), standard deviation (solid error bar) and minimum-maximum range (dash error bar) with respect to 50 test episodes. The number behind RMBIL stands for the value of the feedback gain used at inference. The number behind GAIL stands for the number of environment interactions used during training (x1000). Note: GAIL needs continuous environments interactions during training, while RMBIL does not.

Iv Experiments

Iv-a Environment Setup

We choose Hopper, Walker2d and HalfCheetah from OpenAI Gym [5] under Mujoco [31] physics engine as the simulation environments. To collect demonstrations, we use TRPO [25] algorithm via stable-baselines [11] to train the expert policies. For each environment, we record the state-action pair sequence with = 1000 steps for = 50 episodes under random initial state .

Iv-B Baselines

We compare RMBIL against four baselines: BC, DART, GAIL and MBIL, where MBIL is a non-robust version of RMBIL (without noise injection). For BC method, a vanilla MLP is implemented to imitate the expert policy. For DART and GAIL methods, we slightly modify the official implementation to fit with our dataset format. For a fair comparison, the network size of the imitated policy for all methods are identical. The details of hyperparameters and network structure are listed in Table.

I .

Hyperparameter Value

No. of hidden neuron (

)
800
No. of hidden neuron ( and ) 320
No. of hidden layers (, and ) 2
activation functions (, and ) ELU
latent dimension for 510
learning rate () 0.01
learning rate ( and ) 0.001
learning rate decay 0.5 per 100 epochs
type of ode-solver adams
absolute tolerance for ode-solver 1e-4
relative tolerance for ode-solver 1e-4
noise standard deviation 0.25
batch size 2048

TABLE I: Hyperparameters for RMBIL training

Iv-C Performance Comparison

Given expert demonstrations, BC, MBIL, and RMBIL can be trained directly. For DART, we allow the algorithm to access the expert policies to generate sub-optimal trajectories during training. For GAIL, the training is unstable and requires interaction with the environment. Therefore, we train GAIL with interactions while saving the checkpoint every interactions, then choose the best one for comparison. At inference, for each method, we execute the trained policy for 50 episodes under random initial conditions defined by the environment, then compute the average and standard deviation. The normalized rewards against the number of demonstrations are shown in Fig. 3 by treating the performance of expert policy as one and random policy as zero. We can observe that RMBIL outperforms BC and DART in the case of extremely few demonstrations. Higher performance at low-data regimes implies that our method has better sample efficiency. With enough number of demonstrations, RMBIL can achieve similar performance to the expert policy, same as DART and BC. In contrast, the performance of GAIL is not directly dependent on demonstrations amount. In addition, there is a performance gap between MBIL and RMBIL in high-data regimes (8% for Hopper, 10% for Walker2d). This phenomenon supports our discussion in Sec. III-C., namely, the trained dynamics is not perfect, therefore, the learned controller should consider robustness in order to handle the model inaccuracies.

Case UnevenEnv SlopeEnv
Hopper (v1) 1m span boxes with (v2) 0.5 deg slope
random heights (0 20mm) (v3) 1.0 deg slope
Walker2d (v1) 2m span boxes with (v2) 0.5 deg slope
random heights (0 20mm) (v3) 1.0 deg slope
TABLE II: Robustness Environments Setting

Iv-D Robustness evaluation

In order to evaluate the robustness across different trained policies, we set up two types of environment disturbances for Hopper and Walker2D cases as described in TABLE.II. Because we are interested in how the feedback gain of the linear controller inside RMBIL affects the robustness of the learned policy, we train RMBIL with based on 50 demonstrations. At inference, we compare the performance of learned robust NDI controller under different gain (0.1,1 and 10). The normalized average reward over 50 trajectories in Fig.4 shows that RMBIL with high-gain obtains better mean than the ones with low-gain in both cases. This observation meets the linear control theory that robustness of the transformed linear system can be enhanced by increasing the feedback gain [37]. In addition, compared to DART, RMBIL has higher rewards mean in Hopper case and similar performance in Walker2d case. In contrast, the comparison of GAIL is hard to analyze since the performance of its imitated policy varying significantly with respect to the number of environments interactions. However, we could still roughly observe that the mean and variance of RMBIL is located on the average performance range of GAIL in both Walker2d and Hopper cases.

On the other hand, through Walker2d cases in Fig.4, we could observe the covariate shift issue exists in the BC method, where the trained BC policy achieves the same rewards as the expert in the default environment, however, when encountering unknown disturbances, the performance of BC policy degrades dramatically. In comparison, since the proposed RMBIL is based on a precise multi-steps dynamics and a nonlinear controller with noise injection, the learned robust controller could avoid overfitting to the expert demonstrations and overcome the environment uncertainties at testing time.

V Conclusion

In this work, we presented RMBIL, a Neural ODE based approach for imitation learning without the need for access to expert policy or environment interaction during training. To the best of our knowledge, we are the first to study IL problem from the perspective of traditional nonlinear control theory with both theoretical and empirical supports. With the theoretical analysis, we prove that the learnable control network inside Neural ODE could approximate an NDI controller by minimizing the training loss. Experiments on complicated Mujco tasks show that RMBIL can achieve the same performance as the expert policy. In addition, for unstable systems, such as Hopper and Waker2d, with environmental disturbances, the performance of RMBIL is competitive to GAIL algorithm and outperforms BC method. Future works may incorporate other existing classic nonlinear control theories and explore multi-tasks applications.

References

  • [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the twenty-first international conference on Machine learning

    ,
    pp. 1. Cited by: §I.
  • [2] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter (2018) Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, pp. 8289–8300. Cited by: §III-A.
  • [3] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252. Cited by: §I, §III-D.
  • [4] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §I.
  • [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IV-A.
  • [6] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018)

    Neural ordinary differential equations

    .
    In Advances in neural information processing systems, pp. 6571–6583. Cited by: §I, §III-A.
  • [7] Y. Ding, C. Florensa, P. Abbeel, and M. Phielipp (2019) Goal-conditioned imitation learning. In Advances in Neural Information Processing Systems, pp. 15298–15309. Cited by: §I.
  • [8] D. Enns, D. Bugajski, R. Hendrick, and G. Stein (1994) Dynamic inversion: an evolving methodology for flight control design. International Journal of control 59 (1), pp. 71–91. Cited by: §I.
  • [9] P. Felsen, P. Lucey, and S. Ganguly (2018) Where will they go? predicting fine-grained adversarial multi-agent motion using conditional variational autoencoders. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 732–747. Cited by: §I.
  • [10] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §III-A.
  • [11] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §IV-A.
  • [12] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §I, §II-A.
  • [13] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §I.
  • [14] I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson (2018) Discriminator-actor-critic: addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925. Cited by: §I.
  • [15] M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg (2017) Dart: noise injection for robust imitation learning. In Conference on robot learning, pp. 143–156. Cited by: §I, §III-C.
  • [16] J. Merel, Y. Tassa, S. Srinivasan, J. Lemmon, Z. Wang, G. Wayne, and N. Heess (2017) Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201. Cited by: §I.
  • [17] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §III-A.
  • [18] A. Y. Ng, S. J. Russell, et al. (2000) Algorithms for inverse reinforcement learning.. Cited by: §I.
  • [19] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. (2018) An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7 (1-2), pp. 1–179. Cited by: §I.
  • [20] D. A. Pomerleau (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §I.
  • [21] L. S. Pontryagin, E. Mishchenko, V. Boltyanskii, and R. Gamkrelidze (1962) The mathematical theory of optimal processes. Cited by: §III-A.
  • [22] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    ,
    pp. 627–635. Cited by: §I, §I.
  • [23] Y. Rubanova, T. Q. Chen, and D. K. Duvenaud (2019) Latent ordinary differential equations for irregularly-sampled time series. In Advances in Neural Information Processing Systems, pp. 5321–5331. Cited by: §I, §III-A.
  • [24] S. Schaal (1999) Is imitation learning the route to humanoid robots?. Trends in cognitive sciences 3 (6), pp. 233–242. Cited by: §I.
  • [25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §IV-A.
  • [26] S. Sieberling, Q. Chu, and J. Mulder (2010) Robust flight control using incremental nonlinear dynamic inversion and angular acceleration prediction. Journal of guidance, control, and dynamics 33 (6), pp. 1732–1742. Cited by: §I, §II-A, §III-B.
  • [27] J. E. Slotine et al. (1991) Applied nonlinear control. Vol. 199. Cited by: §I, §II-A.
  • [28] J. Slotine and S. S. Sastry (1983) Tracking control of non-linear systems using sliding surfaces, with application to robot manipulators. International journal of control 38 (2), pp. 465–492. Cited by: §II-B, §II-B.
  • [29] S. A. Snell, D. F. Nns, and W. L. Arrard (1992) Nonlinear inversion flight control for a supermaneuverable aircraft. Journal of guidance, control, and dynamics 15 (4), pp. 976–984. Cited by: §I.
  • [30] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §I, §III-D.
  • [31] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §IV-A.
  • [32] F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §I, §I, §II-A.
  • [33] J. Walker, C. Doersch, A. Gupta, and M. Hebert (2016) An uncertain future: forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835–851. Cited by: §I.
  • [34] Z. Wang, J. S. Merel, S. E. Reed, N. de Freitas, G. Wayne, and N. Heess (2017) Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pp. 5320–5329. Cited by: §II-A, §III-D.
  • [35] Y. Wu, N. Charoenphakdee, H. Bao, V. Tangkaratt, and M. Sugiyama (2019) Imitation learning from imperfect demonstration. arXiv preprint arXiv:1901.09387. Cited by: §I.
  • [36] Y. D. Zhong, B. Dey, and A. Chakraborty (2019) Symplectic ode-net: learning hamiltonian dynamics with control. arXiv preprint arXiv:1909.12077. Cited by: §I, §III-A.
  • [37] K. Zhou (1998) Essentials of robust control. Vol. 104. Cited by: §IV-D.