I Introduction
For the majority of the recent Imitation Learning (IL) works, interactions with environments are considered, and several algorithms have been proposed to solve the IL problem in this context, such as Inverse Reinforcement Learning (IRL)
[18, 1]. The stateoftheart Generative Adversarial Imitation Learning (GAIL) [12] is also based on prior IRL works. The success of GAIL gains popularity of the adversarial IL (AIL) framework in IL research field [16, 7, 35, 14]. However, the reinforcement loop inside AIL method has a risk of driving the learned policy to visit unsafe or undefined statesspace during training. As such, in this work, we attend the scenario where the imitated policy is NOT allowed to interact with environments or access information from the expert policy in training phase.For the scenarios where interactions are not accessible, a common approach is still the Behavior Cloning [20, 24]
, which imitates the expert by approximating the conditional distribution of actions over states in a supervised learning fashion. With sufficient demonstrations collected from the expert, the BC methods have successfully found its wide applications in autonomous driving
[4] and robot locomotion [32]. Nevertheless, the robustness of BC is not guaranteed because of the compounding errors caused by covariate shift issue [22].Some efforts have been made to address the compounding errors issue under BC framework. Ross and Bagnell proposed DAgger [22] that enables the expert policy to correct the behaviors of imitated policy during training. Mahler and Goldberg [15] introduced Disturbances for Augmenting Robot Trajectory (DART) that applies the expert policy to generate suboptimal demonstrations. Torabi et al. presented BCO [32] with an inverse dynamics model for inferring actions from observations through environment exploration. Nevertheless, all these algorithms require interactions with either the environment or the expert policy during training, which is against our problem settings.
In order to effectively utilize the embedded physical information from the finite expert demonstrations, we choose the modelbased scheme rather than BC like or modelfree approaches. However, performance of modelbased approach is directly influenced by accuracy of the learned system dynamics, as well as stability and robustness of learned controller. In the deep learning field, such properties are hard to be verified and guaranteed. Therefore, we borrow concepts and definitions from the nonlinear control theories so as to analyze the proposed framework.
As depicted by Osa et al. [19], the modelbased imitation learning (MBIL) problem could be considered as a typical closedloop tracking control problem, which is composed of system dynamics, tracking controller, and trajectory generator (shown in Fig.1(a)). In traditional nonlinear control field, such tracking control problem can be transformed into an equivalent linear system through Nonlinear Dynamics Inversion (NDI) [29, 8]) algorithm (shown in Fig.1(b)). As a result, the transformed linear system can be flexibly controlled by a linear controller. By applying NDI concept to MBIL problem, we could adopt some nonlinear control methodology, such as stability and robustness analysis [8]. However, the stability of NDI is guaranteed if the dynamics model matches the physical system [26].
The recent Neural ODE research [6] presents a new framework to learn a precise multisteps continuous system dynamics with irregular timeseries data. With this seminal theory, Rubanova et al. [23] and Zhong et al. [36] further proposed some improvements in their works. As one key part of this work, we advocate a Neural ODE based purely datadriven approach for learning a multisteps continuous actuated dynamics by applying the zeroorderhold to the control inputs. Based on the precise learned dynamics, we propose the Robust ModelBased Imitation Learning (RMBIL) framework, which formulates MBIL as an endtoend differentiable nonlinear tracking control problem via NDI algorithm, as illustrated in Fig.1(c), where the system dynamics and NDI control policy are trained with a closedloop Neural ODE model. On the other hand, the trajectory generator block in MBIL methodology is equivalent to the Reference Model (RM) in NDI framework. The RM module is usually a handdesigned kinodynamics trajectory approximator which mimics the expert’s behaviors via a highorder polynomial function. Inspired by recent works in trajectories predictions [33, 3, 9]
, we further adopt the Conditional Variational Autoencoder (CVAE)
[30, 13], conditioned on the previous state, to replace the handdesigned RM block. At inference, only the learned NDI controller and the decoder from the trained CVAE are used for imitating expert’s trajectories, as shown in Fig.1(d). In addition, to increase robustness of the proposed framework, we take the concept of the sliding surface from the Sliding Mode Control (SMC), which is a traditional nonlinear robust control algorithm [27]. Empirically, we find that the sliding surface could be approximated by injecting noise to the system state during the training of the NDI controller with a closedloop Neural ODE model. As a result, RMBIL obtains an approximated 30% performance increase over BC, and is competitive to GAIL approach, in the disturbed environments.Ii Preliminary
Given expert demonstrations , where is a finite sequence of stateaction pairs for samples, is sampled from state trajectory , is sampled from policy outputs , and
is a context vector that represents the initial state
of sequence . A common method to solve MBIL is to explicitly train a discrete forward dynamics: , as well as a tracking controller: , where the reference state is provided by a trajectory generator: , for all tasks. Since all models should be independent of , without losing the generality, we assume for the following derivation in order to remove superscript for simplicity.Iia Nominal Control: Nonlinear Dynamics Inversion
The main concept behind NDI is to cancel nonlinearities of a system via inputoutput linearization [27]. To review the theory of NDI, consider a continuous inputaffine system defined as:
(1)  
(2) 
where is the output vector, and are smooth vector fields, and is a matrix whose columns are smooth vector fileds. To find the explicit relation between the outputs and the control inputs , Eq. (2) will be differentiated repeatedly (for simplicity, we denote ).
(3) 
where is the gradient operator. If the term in Eq.(3) is not zero, then the inputoutput relation is found. To achieve desired outputs , Eq.(3) is reformulated as:
(4) 
where is a virtual input that is tracked by the derivative of output . Hence, by controlling , the corresponding in Eq.(4) would take in Eq.(2) to the desired output through Eq.(1). A common approach to obtain suitable is using a linear proportional feedback controller:
(5) 
where is a diagonal matrix with a manually chosen gain value. Eq.(4) will hold if is invertible. If is a nonsquare matrix, then the pseudoinverse method may be applied. In addition, we assume our system is fullyobservable and , which are conventions under MDP assumption in IL field [12, 34, 32]. Therefore,
is an identity matrix, and Eq.(
4) could be rewritten as follows:(6) 
where is a function of and according to Eq.(5). By applying Eq.(6) to the physical platform, the whole system becomes linear and can be flexibly controlled by Eq.(5). However, stability of a NDI controller is guaranteed if the dynamics model (Eq.(1) and (2)) matches the physical platform [26].
IiB Ancillary Control: Sliding Mode Control
The idea behind SMC is to design a nonlinear controller to force the disturbed system to slide along the desired trajectory. As defined in [28], the SMC scheme consists of two steps: (1) Find a sliding surface such that the system will stay confined within it when the system trajectory reaches the surface within the finite time. In our case , where for the fullyobservable system. (2) Design a switching function , which represents the distance between the current state and sliding surface, for the feedback control law to lead the system trajectory to intersect and stay on the . In this work, we define to satisfy the condition that when .
The sufficient condition [28] for the global stability of SMC is . To satisfy the condition and dynamics model (Eq.(1)), the resulting nonlinear robust control policy is designed as follows (Denote ):
(7)  
(8)  
(9) 
where is a positive definite matrix. Therefore, by adding an additional feedback term , which is dependent on the switching function , to the nominal NDI controller , we may obtain a robust NDI control law . In addition, Eq.(9) also implies that would be identical to when .
Iii Methodology
Iiia Multisteps Actuated Dynamics using Neural ODE
Instead of learning a discrete dynamics, we are interested in a continuoustime dynamics with control inputs that can be formulated as: . To approximate such differential dynamical systems with a neural network , we adopt the Neural ODE model proposed by Chen et al. [6], which solves the initialvalue problem (Eq.(10
)) in a supervised learning fashion by backpropagating through a blackbox ODE solver using adjoint sensitivity method
[21] for memory efficiency.(10) 
To update the dynamics with respect to weights
, the loss function
can be constructed as a L2norm between the predicted state and the true state for a certain timehorizon , that is: . However, Eq.(10) can only handle an autonomous system . In order to apply Neural ODE to an actuated dynamics, Rubanova et al.[23] proposed a RNN based framework to map the system into latent space and solve the latent autonomous dynamics, while Zhong et al.[36] introduced an augmented dynamics by appending the control input to the system state under a strong assumption that the control signal must stay constant over the timehorizon .In this work, we propose a more intuitive and general method to handle the actuated dynamics. We construct a continuous input function based on the sampled control signal by zeroorderhold (ZOH). For each internal integration time inside Neural ODE, the actuated dynamics could always obtain the corresponding control signal by accessing the input function , as illustrated in Fig. 2. As such, the Neural ODE (Eq.(10)) could be directly applied to an arbitrary dynamic system with external controls without any additional assumption or constraint. Once with , where is the integration step of ODE solver, we obtain a precise multisteps continuous actuated dynamics based on discrete expert demonstrations
. Unfortunately, compared to the baseline model such as multilayer perceptron (MLP), the integration solver inside Neural ODE becomes the bottleneck of inference time, which limits the learned dynamics to be integrated into classic modelbased planning and control scheme such as Model Predictive Control (MPC).
[17, 10, 2].IiiB Learning based NDI Controller via Neural ODE
To overcome the above inference time bottleneck caused by the ODE solver, we train a controller network via the learned dynamics with the Neural ODE (Eq.(10)). As a result, only the learned controller will be adopted for controlling the physical system. Based on this concept, the Proposition III.1 expresses how a control policy, parameterized by , approximates the NDI formulation (Eq.(6)).
Proposition III.1
NDI Controller Training
Assume , and suppose , then if along trajectories, where .
Proposition III.1 implies that by minimizing the loss , the NDI controller is learned with Neural ODE if the trained dynamics is accurate and can be expressed as an affine system. To prove this proposition, we start from the loss function with respect to the controller weight .
(11) 
The first term at RHS of Eq.(11) is zero since is given as an initial condition. We let without loss of generality, and the results can be easily extended to multisteps case.
(12)  
(13) 
The training goal of Neural ODE, which belongs to the supervised learning family, is to minimize the loss . According to Eq.(13), the loss will be the minimum if the first term equals to (or approximates) the second term as follows:
(14) 
In order to obtain the NDI formulation from Eq.(14), we introduce the first assumption that the equal sign still holds after applying the time derivative on both sides of Eq.(14) if along trajectories, where . In general, for arbitrary two functions whose value are identical at a certain point, the values of time derivative at the same point are not guaranteed to be equal. However, in Eq.(14), the stateaction pairs’ sequence data used on both sides is from the same demonstrations . Therefore, if the training loss of the learned dynamics approaches zero with (in practice, we stop the training as , where ), then both sides of Eq.(14) represent the same system trajectory . Hence, we could apply the time derivative operator to Eq.(14) and the equal sign is still held as follows:
(15)  
(16) 
To further derive Eq.(16), we introduce the second assumption that the true dynamics, where the proposed RMBIL is trying to mimic, is an inputaffine system. This assumption is roughly satisfied with most physical controllable platforms. Under this assumption, Eq.(16) can be reformulated as:
(17)  
(18) 
where from Eq.(5). By substituting the controller network for the expected control inputs , the relation to the NDI formulation emerges from Eq.(18). Since the two assumptions discussed above may not be fulfilled perfectly in practice, we replace the equal sign with an approximation sign. In addition, we reformulate Eq.(18) to bring to the LHS:
(19)  
(20) 
where Eq.(20) is derived from replacing the RHS of Eq.(19) with Eq.(6). As a result, with Eq.(14), (15) and (20), Proposition III.1 is validated. We have to emphasize here that is only applied to the state space where we observed from the dataset . Although the proposition indicates that a learned policy with Neural ODE can be treated as a NDI controller, the robustness of the controller is not guaranteed [26].
IiiC Robustness Improvement through Noise Injection
To improve robustness of the learned NDI controller , we borrow the concept of SMC discussed in Section IIB. Rather than building a hierarchical ancillary controller, we intend to design a robust NDI policy network which is endtoend differentiable through Neural ODE. Inspired by DART [15], we refine the trained NDI controller by adding zeromean gaussian noise to the internal state within Neural ODE so as to construct the switching function (Eq.(7)). As stated with proposition III.2, the robustness of the refined controller would be improved because an ancillary SMC policy (Eq.(9)) is formulated automatically when (the training loss for ) approaching zero. Due to space limits, please refer to the link ^{1}^{1}1https://github.com/haochihlin/ICRA2021/blob/master/Appendix.pdf for the proof details.
Proposition III.2
Controller Robustness Improvement
Refine the learned controller
with noised injected state, sampled from Gaussian distribution,
, under finite training epochs, then
when , where .IiiD Conditional VAE for Trajectory Generation
The training objective for the trajectory generator in MBIL framework is to predict the reference state given the past states. To support multitask scenario, inspired by [3, 34] , we use CVAE [30] to predict the future trajectory , but without the need for embedding multisteps dynamics information because the learnable dynamics and robust controller have carried on such information. At training, the generative mode of the proposed CVAE , parameterized by , encodes the current state to an embedding vector conditioned on the variable , which in our case is the previous state . Given , the inference network , prarmeterized by , decodes the current state under the same condition variable . Both network parameters ( and ) are updated by minimizing the following loss:
(21) 
where
is the KullbackLeibler divergence between the approximated posterior and conditional prior
which is assumed as the standard Gaussian distribution . In addition, the first term on the RHS of Eq.(21) represents the reconstruction loss.For inference, by feeding the current system state as the context vector , the trained inference network would predict the state at next time step which can be treated as the reference for the tracking controller , where the embedding vector is sampled from the assumed prior .
IiiE Procedure for Training and Inference
The training pipeline is composed of two main modules, the Neural ODE based dynamicscontroller module and the CVAE based generator module. As described in Algorithm.1, the dynamicscontroller module is trained in three phases with Neural ODE model. (1) Starting by training the dynamics model . (2) Once , enable the control policy network to learn a NDI controller. (3) When the loss , start adding noise to the internal state to obtain a robust controller . The method to apply the noise injection inside Neural ODE is illustrated in Fig.1(c). In contrast, the training procedure of the generator module is straightforward, that is, minimizing the loss defined in Eq.(21) until both reconstruction loss and KL divergence converge. For inference, as illustrated in Fig.1(d), only the trained robust tracking controller and the trained inference network are used for driving the physical platform to mimic expert’s behavior given initial state .
Iv Experiments
Iva Environment Setup
We choose Hopper, Walker2d and HalfCheetah from OpenAI Gym [5] under Mujoco [31] physics engine as the simulation environments. To collect demonstrations, we use TRPO [25] algorithm via stablebaselines [11] to train the expert policies. For each environment, we record the stateaction pair sequence with = 1000 steps for = 50 episodes under random initial state .
IvB Baselines
We compare RMBIL against four baselines: BC, DART, GAIL and MBIL, where MBIL is a nonrobust version of RMBIL (without noise injection). For BC method, a vanilla MLP is implemented to imitate the expert policy. For DART and GAIL methods, we slightly modify the official implementation to fit with our dataset format. For a fair comparison, the network size of the imitated policy for all methods are identical. The details of hyperparameters and network structure are listed in Table.
I .Hyperparameter  Value 
No. of hidden neuron ( ) 
800 
No. of hidden neuron ( and )  320 
No. of hidden layers (, and )  2 
activation functions (, and )  ELU 
latent dimension for  510 
learning rate ()  0.01 
learning rate ( and )  0.001 
learning rate decay  0.5 per 100 epochs 
type of odesolver  adams 
absolute tolerance for odesolver  1e4 
relative tolerance for odesolver  1e4 
noise standard deviation  0.25 
batch size  2048 

IvC Performance Comparison
Given expert demonstrations, BC, MBIL, and RMBIL can be trained directly. For DART, we allow the algorithm to access the expert policies to generate suboptimal trajectories during training. For GAIL, the training is unstable and requires interaction with the environment. Therefore, we train GAIL with interactions while saving the checkpoint every interactions, then choose the best one for comparison. At inference, for each method, we execute the trained policy for 50 episodes under random initial conditions defined by the environment, then compute the average and standard deviation. The normalized rewards against the number of demonstrations are shown in Fig. 3 by treating the performance of expert policy as one and random policy as zero. We can observe that RMBIL outperforms BC and DART in the case of extremely few demonstrations. Higher performance at lowdata regimes implies that our method has better sample efficiency. With enough number of demonstrations, RMBIL can achieve similar performance to the expert policy, same as DART and BC. In contrast, the performance of GAIL is not directly dependent on demonstrations amount. In addition, there is a performance gap between MBIL and RMBIL in highdata regimes (8% for Hopper, 10% for Walker2d). This phenomenon supports our discussion in Sec. IIIC., namely, the trained dynamics is not perfect, therefore, the learned controller should consider robustness in order to handle the model inaccuracies.
Case  UnevenEnv  SlopeEnv 

Hopper  (v1) 1m span boxes with  (v2) 0.5 deg slope 
random heights (0 20mm)  (v3) 1.0 deg slope  
Walker2d  (v1) 2m span boxes with  (v2) 0.5 deg slope 
random heights (0 20mm)  (v3) 1.0 deg slope 
IvD Robustness evaluation
In order to evaluate the robustness across different trained policies, we set up two types of environment disturbances for Hopper and Walker2D cases as described in TABLE.II. Because we are interested in how the feedback gain of the linear controller inside RMBIL affects the robustness of the learned policy, we train RMBIL with based on 50 demonstrations. At inference, we compare the performance of learned robust NDI controller under different gain (0.1,1 and 10). The normalized average reward over 50 trajectories in Fig.4 shows that RMBIL with highgain obtains better mean than the ones with lowgain in both cases. This observation meets the linear control theory that robustness of the transformed linear system can be enhanced by increasing the feedback gain [37]. In addition, compared to DART, RMBIL has higher rewards mean in Hopper case and similar performance in Walker2d case. In contrast, the comparison of GAIL is hard to analyze since the performance of its imitated policy varying significantly with respect to the number of environments interactions. However, we could still roughly observe that the mean and variance of RMBIL is located on the average performance range of GAIL in both Walker2d and Hopper cases.
On the other hand, through Walker2d cases in Fig.4, we could observe the covariate shift issue exists in the BC method, where the trained BC policy achieves the same rewards as the expert in the default environment, however, when encountering unknown disturbances, the performance of BC policy degrades dramatically. In comparison, since the proposed RMBIL is based on a precise multisteps dynamics and a nonlinear controller with noise injection, the learned robust controller could avoid overfitting to the expert demonstrations and overcome the environment uncertainties at testing time.
V Conclusion
In this work, we presented RMBIL, a Neural ODE based approach for imitation learning without the need for access to expert policy or environment interaction during training. To the best of our knowledge, we are the first to study IL problem from the perspective of traditional nonlinear control theory with both theoretical and empirical supports. With the theoretical analysis, we prove that the learnable control network inside Neural ODE could approximate an NDI controller by minimizing the training loss. Experiments on complicated Mujco tasks show that RMBIL can achieve the same performance as the expert policy. In addition, for unstable systems, such as Hopper and Waker2d, with environmental disturbances, the performance of RMBIL is competitive to GAIL algorithm and outperforms BC method. Future works may incorporate other existing classic nonlinear control theories and explore multitasks applications.
References

[1]
(2004)
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, pp. 1. Cited by: §I.  [2] (2018) Differentiable mpc for endtoend planning and control. In Advances in Neural Information Processing Systems, pp. 8289–8300. Cited by: §IIIA.
 [3] (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252. Cited by: §I, §IIID.
 [4] (2016) End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316. Cited by: §I.
 [5] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IVA.
 [6] (2018) . In Advances in neural information processing systems, pp. 6571–6583. Cited by: §I, §IIIA.
 [7] (2019) Goalconditioned imitation learning. In Advances in Neural Information Processing Systems, pp. 15298–15309. Cited by: §I.
 [8] (1994) Dynamic inversion: an evolving methodology for flight control design. International Journal of control 59 (1), pp. 71–91. Cited by: §I.

[9]
(2018)
Where will they go? predicting finegrained adversarial multiagent motion using conditional variational autoencoders.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 732–747. Cited by: §I.  [10] (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §IIIA.
 [11] (2018) Stable baselines. GitHub. Note: https://github.com/hilla/stablebaselines Cited by: §IVA.
 [12] (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §I, §IIA.
 [13] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §I.
 [14] (2018) Discriminatoractorcritic: addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925. Cited by: §I.
 [15] (2017) Dart: noise injection for robust imitation learning. In Conference on robot learning, pp. 143–156. Cited by: §I, §IIIC.
 [16] (2017) Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201. Cited by: §I.
 [17] (2018) Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §IIIA.
 [18] (2000) Algorithms for inverse reinforcement learning.. Cited by: §I.
 [19] (2018) An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7 (12), pp. 1–179. Cited by: §I.
 [20] (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §I.
 [21] (1962) The mathematical theory of optimal processes. Cited by: §IIIA.

[22]
(2011)
A reduction of imitation learning and structured prediction to noregret online learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 627–635. Cited by: §I, §I.  [23] (2019) Latent ordinary differential equations for irregularlysampled time series. In Advances in Neural Information Processing Systems, pp. 5321–5331. Cited by: §I, §IIIA.
 [24] (1999) Is imitation learning the route to humanoid robots?. Trends in cognitive sciences 3 (6), pp. 233–242. Cited by: §I.
 [25] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §IVA.
 [26] (2010) Robust flight control using incremental nonlinear dynamic inversion and angular acceleration prediction. Journal of guidance, control, and dynamics 33 (6), pp. 1732–1742. Cited by: §I, §IIA, §IIIB.
 [27] (1991) Applied nonlinear control. Vol. 199. Cited by: §I, §IIA.
 [28] (1983) Tracking control of nonlinear systems using sliding surfaces, with application to robot manipulators. International journal of control 38 (2), pp. 465–492. Cited by: §IIB, §IIB.
 [29] (1992) Nonlinear inversion flight control for a supermaneuverable aircraft. Journal of guidance, control, and dynamics 15 (4), pp. 976–984. Cited by: §I.
 [30] (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §I, §IIID.
 [31] (2012) Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §IVA.
 [32] (2018) Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §I, §I, §IIA.
 [33] (2016) An uncertain future: forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835–851. Cited by: §I.
 [34] (2017) Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pp. 5320–5329. Cited by: §IIA, §IIID.
 [35] (2019) Imitation learning from imperfect demonstration. arXiv preprint arXiv:1901.09387. Cited by: §I.
 [36] (2019) Symplectic odenet: learning hamiltonian dynamics with control. arXiv preprint arXiv:1909.12077. Cited by: §I, §IIIA.
 [37] (1998) Essentials of robust control. Vol. 104. Cited by: §IVD.
Comments
There are no comments yet.