Demonstration-Efficient Guided Policy Search via Imitation of Robust Tube MPC

by   Andrea Tagliabue, et al.

We propose a demonstration-efficient strategy to compress a computationally expensive Model Predictive Controller (MPC) into a more computationally efficient representation based on a deep neural network and Imitation Learning (IL). By generating a Robust Tube variant (RTMPC) of the MPC and leveraging properties from the tube, we introduce a data augmentation method that enables high demonstration-efficiency, being capable to compensate the distribution shifts typically encountered in IL. Our approach opens the possibility of zero-shot transfer from a single demonstration collected in a nominal domain, such as a simulation or a robot in a lab/controlled environment, to a domain with bounded model errors/perturbations. Numerical and experimental evaluations performed on a trajectory tracking MPC for a quadrotor show that our method outperforms strategies commonly employed in IL, such as DAgger and Domain Randomization, in terms of demonstration-efficiency and robustness to perturbations unseen during training.



There are no comments yet.


page 1

page 6


Differentiable MPC for End-to-end Planning and Control

We present foundations for using Model Predictive Control (MPC) as a dif...

Imitation Learning for Autonomous Trajectory Learning of Robot Arms in Space

This work adds on to the on-going efforts to provide more autonomy to sp...

MPC-guided Imitation Learning of Neural Network Policies for the Artificial Pancreas

Even though model predictive control (MPC) is currently the main algorit...

MPC-Net: A First Principles Guided Policy Search

We present an Imitation Learning approach for the control of dynamical s...

Imitation Learning from MPC for Quadrupedal Multi-Gait Control

We present a learning algorithm for training a single policy that imitat...

Meta Adaptation using Importance Weighted Demonstrations

Imitation learning has gained immense popularity because of its high sam...

Intervention Design for Effective Sim2Real Transfer

The goal of this work is to address the recent success of domain randomi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Model Predictive Control (MPC[borrelli2017predictive] enables impressive performance on complex, agile robots [lopez2019dynamic, lopez2019adaptive, li2004iterative, kamel2017linear, minniti2019whole, williams2016aggressive]. However, its computational cost often limits the opportunities for onboard, real-time deployment [lco2020170:online], or takes away critical computational resources needed by other components governing the autonomous system. Recent works have mitigated MPC’s computational requirements by relying on computationally efficient deep neural networks, which are leveraged to imitate task-relevant demonstrations generated by MPC. Such demonstrations are generally collected via Guided Policy Search (GPS[levine2013guided, kahn2017plato, zhang2016learning, carius2020mpc] and Imitation Learning (IL[kaufmann2020deep, ross2013learning, reske2021imitation] and are then used to train a DNN

via supervised learning.

A common issue in existing IL methods (e.g., Behavior Cloning (BC[pomerleau1989alvinn, osa2018algorithmic, bojarski2016end], Dataset-Aggregation (DAgger[ross2011reduction]) is that they require collecting a relatively large number of MPC demonstrations, preventing training directly on a real robot and requiring a simulation environment that accurately represents the deployment domain. One of the causes for such demonstration inefficiency is the need to take into account and correct for the compounding of errors in the learned policy [ross2011reduction], which may otherwise create shifts (covariate shifts) from the training distribution, with catastrophic consequences [pomerleau1989alvinn]. These errors can be caused by: a) approximation errors in the learned policy; b) mismatches in the simulation dynamics due to modelling errors (i.e., sim2real gap); or c) model changes or disturbances that may not be present in a controlled training environment (e.g., simulation, or lab/factory when training on a real robot), but do appear during deployment in the real-world (i.e., lab2real gap). Approaches employed to compensate for such gaps, such as Domain Randomization (DR[peng2018sim, loquercio2019deep], introduce further challenges, for example, requiring to apply disturbances/model changes during training.

Fig. 1: Overview of the proposed approach. To compress a computationally expensive MPC controller in a DNN-based policy in a demonstration efficient way, we generate a Robust Tube MPC controller using a model of the disturbances encountered in the deployment domain. We use properties of the tube to derive a data augmentation strategy which generates extra state-action pairs , obtaining via Imitation Learning. Our approach enables zero-shot transfer from a single demonstration collected in simulation (sim2real) or a controlled environment (lab, factory, lab2real).

In this work, we address the problem of generating a compressed MPC policy in a demonstration-efficient manner by providing a data augmentation strategy that systematically compensates for the effects of covariate shifts that might be encountered during real-world deployment. Specifically, our approach relies on a model of the perturbations/uncertainties encountered in a deployment domain, which is used to generate a Robust Tube version of the given MPC (RTMPC) to guide the collection of demonstrations and the data augmentation. An overview of the strategy is provided in Figure 1. Our approach is related to the recent LAG-ROS framework [tsukamoto2021learning], which provides a learning-based method to compress a planner in a DNN by extracting relevant information from the robust tube. LAG-ROS emphasizes the importance of nonlinear contraction-based controllers (e.g., CV-STEM [tsukamoto2020neural]) to obtain robustness and stability guarantees. In a complementary way, our work emphasizes minimal requirements - namely a tube and a data augmentation strategy - to achieve demonstration efficiency and robustness to real-world conditions. By decoupling these aspects from the need for complex nonlinear models and control strategies, we greatly simplify the controller design and reduce the computational complexity, which enables lab2real transfers.

Contributions. Via numerical comparison with previous IL methods and experimental validations, we show that demonstration efficiency can be achieved in MPC-compression by generating a corresponding Robust Tube MPC (RTMPC) and using the tube to guide the data augmentation strategy for IL methods (DAgger, BC). To this end, we propose a data-sparse, computationally efficient (i.e., scales linearly in state size) adversarial Sampling Augmentation (SA) strategy for data augmentation. We highlight that the proposed approach, for example, can be used to train the robot in a low-fidelity simulation environment while achieving robustness to real-world perturbations unseen during the training phase. We validate the proposed approach by providing the first experimental (hardware) demonstration of zero-shot transfer of a DNN-based trajectory tracking controller for an aerial robot, learned from a single demonstration, in an environment (low-fidelity simulation) without disturbances, and transferred to an environment with wind-like disturbances.

Ii Related Work

MPC-like policy compression for mobile robots via IL and GPS. IL methods have found application in multiple robotics tasks. The works in [ross2013learning, pan2017agile] use DAgger [ross2011reduction] to control aerial and ground robots, while [kaufmann2020deep] uses a combination of DAgger and DR to learn to perform acrobatic maneuvers with a quadrotor. Similarly, GPS methods have been demonstrated in simulation for navigation of a multirotor [zhang2016learning, kahn2017plato], and to control a legged robot [carius2020mpc, reske2021imitation]. These methods achieve impressive performance, but at the cost of requiring multiple demonstrations to execute a single trajectory, and do not explicitly take into account the effects of disturbances encountered in the deployment domain.

Sample efficiency and robustness in IL. Robustness in IL, required to deal with distribution shifts caused by the sim2real (model errors) or lab2real transfer (model changes, external disturbances), has been achieved a) by modifying the training domain so that its dynamics match the deployment domain, as done in DR [peng2018sim, loquercio2019deep, farchy2013humanoid, chebotar2019closing], or b) by modifying the actions of the expert, so that the state distribution during training matches the one encountered at deployment, as proposed in [laskey2017dart, laskey2018and, hanna2017grounded, desai2020imitation]. Although effective, these approaches do not leverage extra information available in the RTMPC, thus requiring a larger number of demonstrations.

Data augmentation in IL. Data augmentation is a commonly employed robustification strategy in IL. Most approaches focus on reducing overfitting in the high-dimensional policy input space (e.g., images), by applying noise [florence2019self], transformations [hendrycks2019augmix] or adversarially-generated perturbations [shuadversarial, antotsiou2021adversarial], while maintaining the corresponding action label unchanged. Data augmentation is also employed to reduce covariate shift in self-driving by generating transformed observations [toromanoff2018end, amini2020learning, bojarski2016end] with the corresponding action label computed via a feedback controller. These approaches do not directly apply to our context, as they do not rely on RTMPC

and we assume available state estimate. Aligned to our findings, 

[levine2013guided, carius2020mpc] observe that adding extra samples from the tube of an existing Iterative- Linear Quadratic Regulator (LQR) can achieve increased demonstration efficiency in GPS. Compared to these, thanks to RTMPC, we can additionally consider the effects of disturbances encountered in the sim2real or lab2real transfer, providing additional robustness.

Iii Method

This section explains the given MPC expert and its Robust-Tube variant, which we leverage to design a data augmentation strategy. We additionally cast the demonstration-efficiency challenge in IL as a robust IL problem in the context of transferring a policy between two different domains, and we present the SA strategy to improve demonstration efficiency and robustness of MPC-guided policies learned via IL.

Iii-a MPC and Robust Tube MPC demonstrator

Model predictive trajectory tracking controller. We assume a trajectory tracking linear MPC [borrelli2017predictive] is given that controls a system subject to bounded uncertainty. The linearized, discrete-time model of the system is:


where represents the state (size ), are the commanded actions (size ), and is an additive perturbation/uncertainty. and represent the nominal dynamics; denotes internal variables of the optimization. At every discrete timestep is given an estimate of the state of the actual system and a reference trajectory . The controller computes a sequence of actions , subject to state and input constraints , and executes the first computed optimal action from the sequence. The optimal sequence of actions is computed by minimizing the value function (dropping in the notation):


where . Matrices (size ) and (size ) are user-selected, positive definite weights that define the stage cost, while (size , positive definite) represents the terminal cost. The prediction horizon is an integer . The system is additionally subject to constraints ; the predicted states are obtained from the model in Eq. 1 assuming the disturbance . The optimization problem is solved again at every timestep, executing the newly recomputed optimal action.

Robust Tube MPC. Given the MPC expert, we generate a Robust Tube variant using [mayne2005robust]. At every discrete timestep , the RTMPC operates in a similar way as MPC, but it additionally accounts for the effects of by introducing a feedback policy, called ancillary controller


where represents the executed action. The quantities represents an optimal, feedforward action, and is an optimal reference, and are computed by the RTMPC given the current state estimate . The ancillary controller ensures that the state of the controlled system remains inside a set (tube) , centered around , for every possible realization of . The quantities and are obtained by solving (dropping the dependence on )


under the constraint that . As in the original MPC formulation, the optimization problem is additionally subject to the given input and actuation constraints, tightened by an amount that takes into account the effects of the disturbances. The gain matrix in Eq. 3 is computed such that is stable, for example by solving the infinite-horizon, discrete-time LQR problem using (, , , ). The set has constant size, and it determines the shape/width of the tube. It is defined as disturbance invariant set for the closed-loop system , and satisfies the property that , , , . In practice, can be computed offline using and the model of the disturbance via ad-hoc algorithms [borrelli2017predictive, mayne2005robust], or can be learned from data [fan2020deep]. The set and the ancillary controller in Eq. 3 ensure (see [mayne2005robust]) that, given a state , the perturbed system in Eq. 1 will remain in the tube centered around the trajectory of , no matter the disturbance realization , as shown in Figure 2. This additionally implies that the tube represents a model of the states that the system may visit when subject to the disturbances in . The ancillary controller provides a computationally-effective way to generate a control action to counteract such perturbations.

Fig. 2: Illustration of the robust control invariant tube centered around the optimal reference computed by RTMPC for each state of the system .

Iii-B Covariate shift in sim2real and lab2real transfer

This part describes the demonstration-efficiency issue in IL as the ability to efficiently predict and compensate for the effects of covariate shifts during real-world deployment. We assume that the causes of such distribution shifts can be modeled as additive state perturbations/uncertainties encountered in the deployment domains.

Policies and state densities. We model the dynamics of the real system as Markovian and stochastic [sutton2018reinforcement]. The stochasticity with respect to state transitions is introduced by unknown perturbations, assumed to be additive (as in Eq. 1) and belonging to the bounded set

, sampled under a (possibly unknown) probability distribution. These perturbations capture the effects of noise, approximation errors in the learned policy, model changes and other disturbances acting on the system. Two different domains

are considered: a training domain (source) and a deployment domain (target

). The two domains differ in their transition probabilities, effectively representing the sim2real or lab2real settings. We additionally assume that the considered system is controlled by a deterministic policy

, where represents the reference trajectory. Given , the resulting transition probability is , denoted to simplify the notation. The probability of collecting a -step trajectory given a generic policy in is , where represents the initial state distribution.

Robust IL objective. Following [laskey2017dart], given an expert RTMPC policy , the objective of IL is to find parameters of that minimize a distance metric . This metric captures the differences between the actions generated by the expert and the action produced by the learner across the distribution of trajectories induced by the learned policy , in the perturbed target domain :


A choice of distance metric that we consider in this paper is the MSE loss: .

Covariate shift due to sim2real and lab2real transfer. Since in practice we do not have access to the target environment, our goal is to try to solve Eq. 5 by finding an approximation of the optimal policy parameters in the source environment:


The way this minimization is solved depends on the chosen IL algorithm. The performance of the learned policy in target and source domain can be related via:


which clearly shows the presence of a covariate shift induced by the transfer. The last term corresponds to the objective minimized by performing IL in . Attempting to solve Eq. 5 by directly optimizing Eq. 6 (e.g., via BC [pomerleau1989alvinn]) provides no guarantees of finding a policy with good performance in .

Compensating transfer covariate shift via Domain Randomization. A well known strategy to compensate for the effects of covariate shifts between source and target domain is Domain Randomization (DR)  [peng2018sim], which modifies the transition probabilities of the source by trying to ensure that the trajectory distribution in the modified training domain matches the one encountered in the target domain: . This is in practice done by sampling perturbations according to some knowledge/hypotheses on their distribution in the target domain [peng2018sim], obtaining the perturbed trajectory distribution . The minimization of Eq. 5 can then be approximately performed by minimizing instead:


This approach, however, requires the ability to apply disturbances/model changes to the system, which may be unpractical e.g., in the lab2real setting, and may require a large number of demonstrations due to the need to sample enough state perturbations .

Iii-C Covariate shift compensation via Sampling Augmentation

We propose to mitigate the covariate shift introduced by the compression procedure not only by collecting demonstrations from the RTMPC, but by using additional information computed in the controller. Unlike DR, the proposed approach does not require to explicitly apply disturbances in the training phase. During the collection of a trajectory in the source domain , we utilize instead the tube computed by the RTMPC demonstrator to obtain knowledge of the states that the system may visit when subjected to perturbations. Given this information, we propose a state sampling strategy, called Sampling Augmentation (SA), to extract relevant states from the tube. The corresponding actions are provided at low computational cost by the demonstrator. The collected state-actions pairs are then included in the set of demonstrations used to learn a policy via IL. The following paragraphs frame the tube sampling problem in the context of covariate shift reduction in IL, and present two tube sampling strategies.

RTMPC tube as model of state distribution under perturbations. The key intuition of the proposed approach is the following. We observe that, although the density is unknown, an approximation of its support , given a demonstration collected in the source domain , is known. Such support corresponds to the tube computed by the RTMPC when collecting :


where is a trajectory in the tube of . This is true thanks to the ancillary controller in Eq. 3, which guarantees that the system remains inside Eq. 9 for every possible realization of . The ancillary controller additionally provides a way to easily compute the actions to apply for every state inside the tube. Let be a state inside the tube computed when the system is at (formally ), then the corresponding robust control action is simply:


For every timestep in , extra state-action samples collected from within the tube can be used to augment the dataset employed for empirical risk minimization, obtaining a way to approximate the expected risk in the domain by only having access to demonstrations collected in :


The demonstrations in the source domain can be collected using existing techniques, such as BC and DAgger.

Tube approximation and sampling strategies.

Fig. 3: The two alternative strategies evaluated to sample extra state/actions pairs from an approximation of the tube of the RTMPC expert: dense (left) and sparse (right).

In practice, the set may have arbitrary shape (not necessarily politopic), and the density may not be available, making difficult to establish where/which states to sample in order to derive a data augmentation strategy. We proceed by approximating as an hyper-rectangle , outer approximation of the tube. We consider an adversarial approach to the problem by sampling from the states visited under worst-case perturbations. We investigate two strategies, shown in Figure 3, to obtains state samples at every state in : i) dense sampling: sample extra states from the vertices of . The approach produces extra state-action samples. It is more conservative, as it produces more samples, but more computationally expensive. ii) sparse sampling: sample one extra state from the center of each facet of , producing additional state-action pairs. It is less conservative and more computationally efficient.

Iv Results

Iv-a Evaluation approach

MPC for trajectory tracking on a multirotor. We evaluate the proposed approach in the context of trajectory tracking control for a multirotor, using the controller proposed in [kamel2017linear], modified to obtain a RTMPC. We model under the assumption that the system is subject to force-like perturbations up to of the weight of the robot (approximately the safe physical limit of the robot). The tube is approximated via Monte-Carlo sampling of the disturbances in , evaluating the state deviations of the closed loop system . The derived controller generates tilt (roll, pitch) and thrust commands () given the state of the robot ( consisting of position, velocity and tilt) and the reference trajectory. The reference is a sequence of desired positions and velocities for the next s, discretized with sampling time of s (corresponding to a planning horizon of , and

-dim. vector). The controller takes into account position constraints (e.g., available 3D flight space), actuation limits, and velocity/tilt limits.

Policy architecture. The compressed policy is a -hidden layers, fully connected DNN, with neurons per layer, and ReLU activation function. The total input dimension of the DNN is (position, velocity, current tilt expressed in an inertial frame, and the desired reference trajectory). The output dimension is (desired thrust and tilt expressed in an inertial frame). We rotate the tilt output of the DNN in body frame to avoid taking into account yaw, which is not part of the optimization problem [kamel2017linear], not causing any relevant computational cost. We additional apply the non-linear attitude compensation scheme as in [kamel2017linear].

Training environment and training details. Training is performed in a custom-built non-linear quadrotor simulation environment, where the robot follows desired trajectories, starting from randomly generated initial states, centered around the origin. Demonstrations are collected with a sampling time of s and training is performed for epochs via the ADAM [kingma2014adam] optimizer, with a learning rate of .

Evaluation details and metrics. We apply the proposed SA strategies to DAgger and BC, and compare their performance against the two without SA, and the two combined with DR. Target and source domain differs due to perturbations sampled from in target. During training with DR we sample disturbances from the entire . In all the comparisons, we set the probability of using actions of the expert

, hyperparameter of DAgger

[ross2011reduction], to be at the first demonstration, and otherwise (as this was found to be the best performing setup). We monitor: i) robustness (Success Rate), as the percentage of episodes where the robot never violates any state constraint; ii) performance (MPC Stage Cost), as along the trajectory.

Iv-B Numerical evaluation of demonstration-efficiency, robustness and performance for tracking a single trajectory

Fig. 4: Robustness (Success Rate) in the task of flying along an eight-shaped, s long-trajectory, subject to wind-like disturbances (left) and without (right), starting from different initial states (Task T1). Evaluation is repeated across random seeds, times per demonstration per seed. We additionally show the confidence interval.
Method Training Robustness succ. rate (%) Performance expert gap (%) Demonstration Efficiency
Robustif. Imitation Easy Safe T1 T2 T1 T2 T1 T2
- BC Yes Yes < 1 100 24.15 29.47 - 6
DAgger Yes No 98 100 15.79 1.34 7 6
DR BC No Yes 95 100 10.04 1.27 14 9
DAgger No No 100 100 4.09 1.45 10 6
SA-Dense BC Yes Yes 100 100 25.64 1.34 1 1
DAgger Yes Yes 100 100 10.21 1.66 1 1
SA-Sparse BC Yes Yes 100 100 4.23 1.13 1 1
DAgger Yes Yes 100 100 3.75 1.07 1 1
TABLE I: Comparison of the IL methods considered for RT-MPC compression. T1 refers to the trajectory tracking task under wind-like disturbances, T2 under model errors (drag coefficient mismatch). At convergence (iteration 20-30) we evaluate, in the target domain, robustness (success rate) and performance (relative percent error between actions of the expert and of the compressed policy). Demonstration-Efficiency represents the number of demonstrations required to achieve for the first time full success rate. An approach is considered easy if it does not require to apply disturbances/perturbations during training (e.g., in lab2real transfer); an approach is considered safe if does not execute actions that may cause state constraints violations (crashes) during training. *Safe in our numerical evaluation, but not guaranteed.

Tasks description. Our objective is to compress an RTMPC policy capable to track a s long, eight-shaped trajectory. We evaluate the considered approaches in two different target domains, with wind-like disturbances (T1) or model errors (T2). Disturbances in T1 are sampled adversarially from (25–30% of the UAV weight), while model errors in T2 are applied via mismatches in the drag coefficients used between training and testing.

Results. We start by evaluating the robustness in T1 as a function of the number of demonstrations collected in the source domain. The results are shown in Fig 4, highlighting that: i) while all the approaches achieve robustness (full success rate) in the source domain, SA achieves full success rate after only a single demonstration, being 5-6 times more sample efficient than the baseline methods; ii) SAis able to achieve full robustness in the target domain, while baseline methods do not fully succeed, and converge at much lower rate. These results remark the presence of a distribution shift between the source and target, which is not fully compensated by baselines methods such as BC, due to lack of exploration and robustness. The performance evaluation and additional results are summarized in Table I. We highlight that in the target domain, sparse SA combined with DAgger achieves closest performance to the expert. Dense SA suffers from performance drops, potentially due to the limited capacity of the considered DNN or challenges in training introduced by this data augmentation. Because of its effectiveness and greater computational efficiency, we use sparse SA for the rest of the work. Table I additionally presents the results for task T2. Although this task is less challenging (i.e., all the approaches achieve full robustness), the proposed method (sparse SA) achieves highest demonstration efficiency and lowest expert gap, with similar trends as T1.

Computation. The average latency (on i7-8750H laptop with NVIDIA GTX1060 GPU) for the expert (MATLAB) is

ms, while for the compressed policy (PyTorch) is

ms, achieving a two-orders of magnitude improvement. The average latency for the compressed policy on an Nvidia TX2 CPU (PyTorch) is ms.

Fig. 5: Experimental evaluation performed by hovering with and without wind disturbances produced by a leaf blower. The employed compressed RTMPC policy is trained in simulation from a single demonstration of the desired trajectory. The wind-like disturbances produce a large position error, but do not destabilize the system. The thrust decreases due to the robot being pushed up by the disturbances. The state estimate (shown in the plot) is provided by onboard VIO. The -axis points approximately in the same direction as the wind.

Iv-C Hardware evaluation for tracking a single trajectory

(a) Reference and actual trajectory
(b) Experimental setup showing the trajectory executed by the robot, and the leaf blowers used to generate disturbances
(c) Effects of wind
Fig. 6: Experimental evaluation of the proposed approach (Figure 5(b)) under wind-like disturbances. The robot learns to track an eight-shaped reference trajectory (Figure 5(a)) from a single demonstration collected in simulation, achieving zero-shot transfer. It is additionally able to withstand disturbances produced by an array of leaf-blowers, unseen during the training phase, and whose effects are clearly visible in the altitude errors (and change in commanded thrust) in Figure 5(c).

We validate the demonstration efficiency, robustness and performance of the proposed approach by experimentally testing policies trained after a single demonstration collected in simulation using DAgger/BC (which operate identically since we use DAgger with for the first demonstration). The data augmentation strategy is based on the sparse SA

. We use the MIT/ACL open-source snap-stack

[acl_snap_stack] for controlling the attitude of the MAV, while the compressed RTMPC runs at Hz on the onboard Nvidia TX2 (on its CPU), with the reference trajectory provided at Hz. State estimation is obtained via a motion capture system or onboard VIO. The first task considered is to hover under wind disturbances produced by a leaf blower. The results are shown in Figure 5, and highlight the ability of the system to remain stable despite the large position error caused by the wind. The second task is to track an eight-shaped trajectory, with velocities up to m/s. We evaluate the robustness of the system by applying a wind-like disturbance produced by an array of leaf blowers (Figure 6). The given position reference and the corresponding trajectory are shown in Figure 5(a). The effects of the wind disturbances are clearly visible in the altitude errors and changes in commanded thrust in Figure 5(a) (at s and s). These experiments show that the controller can robustly track the desired reference, withstanding challenging perturbations unseen during the training phase. The video submission presents more experiments, including lab2real transfer.

Fig. 7: Robustness (Success Rate, top row) and performance (MPC Stage Cost, bottom row) of the proposed approach (with confidence interval), as a function of the number of demonstrations used for training, for the task of learning to track previously unseen circular, eight-shaped and constant position reference trajectories, sampled from the same training distribution. The left column presents the results in the training domain (no disturbance), while the right column under wind-like perturbations (with disturbance). The proposed RTMPC driven sparse SA strategy learns to track multiple trajectory and generalize to unseen ones requiring less demonstrations. At convergence (from demonstration to ), DAgger+SA achieves the closes performance to the expert ( expert gap), followed by BC+SA ( expert gap). Evaluation performed using randomly sampled trajectories per demonstration, repeated across 6 random seeds, with prediction horizon of to speed-up demonstration collection, and the DNN input size is adjusted accordingly.
Fig. 8: Examples of different, arbitrary chosen trajectories from the training distribution, tested in hardware experiments with and without strong wind-like disturbances produced by leaf blowers. The employed policy is trained with demonstrations (when other baseline methods have not fully converged yet, see Figure 7) using DAgger+SA (sparse). This highlights that sparse SA can learn multiple trajectories in a more sample-efficient way than other IL methods, retaining RTMPC’s robustness and performance. Prediction horizon used is , and DNN input size is adjusted accordingly.

Iv-D Numerical and hardware evaluation for learning and generalizing to multiple trajectories

We evaluate the ability of the proposed approach to track multiple trajectories while generalizing to unseen ones. To do so, we define a training distribution of reference trajectories (circle, position step, eight-shape) and a distribution for these trajectory parameters (radius, velocity, position). During training, we sample at random a desired, s long ( steps) reference with randomly sampled parameters, generating a demonstration and updating the proposed policy, while testing on a set of , s long trajectories randomly sampled from the defined distributions. We monitor the robustness and performance of the different methods, with force disturbances applied in the target domain. The results of the numerical evaluation, shown in Figure 7, confirm that sparse SA i) achieves robustness and performance comparable to the one of the expert in a sample efficient way, requiring less than half the demonstrations than baseline approaches; ii) simultaneously learns to generalize to multiple trajectories randomly sampled from the training distribution. The hardware evaluation, performed with DAgger augmented via sparse SA, is shown in Figure 8. It confirms that the proposed approach is experimentally capable of tracking multiple trajectories under real-world disturbances/model errors.

V Conclusion and Future Work

This work has presented a demonstration-efficient strategy to compress a MPC in a computationally efficient representation, based on a DNN, via IL. We showed that greater sample efficiency and robustness than existing IL methods (DAgger, BC and their combination with DR) can be achieved by designing a Robust Tube variant of the given MPC, using properties of the tube to guide a sparse data augmentation strategy. Experimental results – showing trajectory tracking control for a multirotor after a single demonstration under wind-like disturbances – confirmed our numerical findings. Future work will focus on designing an adaptation strategy.


This work was funded by the Air Force Office of Scientific Research MURI FA9550-19-1-0386.