## I Introduction

Model Predictive Control (MPC) [borrelli2017predictive] enables impressive performance on complex, agile robots [lopez2019dynamic, lopez2019adaptive, li2004iterative, kamel2017linear, minniti2019whole, williams2016aggressive]. However, its computational cost often limits the opportunities for onboard, real-time deployment [lco2020170:online], or takes away critical computational resources needed by other components governing the autonomous system. Recent works have mitigated MPC’s computational requirements by relying on computationally efficient deep neural networks, which are leveraged to imitate task-relevant demonstrations generated by MPC. Such demonstrations are generally collected via Guided Policy Search (GPS) [levine2013guided, kahn2017plato, zhang2016learning, carius2020mpc] and Imitation Learning (IL) [kaufmann2020deep, ross2013learning, reske2021imitation] and are then used to train a DNN

via supervised learning.

A common issue in existing IL methods (e.g., Behavior Cloning (BC) [pomerleau1989alvinn, osa2018algorithmic, bojarski2016end], Dataset-Aggregation (DAgger) [ross2011reduction]) is that they require collecting a relatively large number of MPC demonstrations, preventing training directly on a real robot and requiring a simulation environment that accurately represents the deployment domain. One of the causes for such demonstration inefficiency is the need to take into account and correct for the compounding of errors in the learned policy [ross2011reduction], which may otherwise create shifts (covariate shifts) from the training distribution, with catastrophic consequences [pomerleau1989alvinn]. These errors can be caused by: a) approximation errors in the learned policy; b) mismatches in the simulation dynamics due to modelling errors (i.e., sim2real gap); or c) model changes or disturbances that may not be present in a controlled training environment (e.g., simulation, or lab/factory when training on a real robot), but do appear during deployment in the real-world (i.e., lab2real gap). Approaches employed to compensate for such gaps, such as Domain Randomization (DR) [peng2018sim, loquercio2019deep], introduce further challenges, for example, requiring to apply disturbances/model changes during training.

In this work, we address the problem of generating a compressed MPC policy in a demonstration-efficient manner by providing a data augmentation strategy that systematically compensates for the effects of covariate shifts that might be encountered during real-world deployment. Specifically, our approach relies on a model of the perturbations/uncertainties encountered in a deployment domain, which is used to generate a Robust Tube version of the given MPC (RTMPC) to guide the collection of demonstrations and the data augmentation. An overview of the strategy is provided in Figure 1. Our approach is related to the recent LAG-ROS framework [tsukamoto2021learning], which provides a learning-based method to compress a planner in a DNN by extracting relevant information from the robust tube. LAG-ROS emphasizes the importance of nonlinear contraction-based controllers (e.g., CV-STEM [tsukamoto2020neural]) to obtain robustness and stability guarantees. In a complementary way, our work emphasizes minimal requirements - namely a tube and a data augmentation strategy - to achieve demonstration efficiency and robustness to real-world conditions. By decoupling these aspects from the need for complex nonlinear models and control strategies, we greatly simplify the controller design and reduce the computational complexity, which enables lab2real transfers.

Contributions. Via numerical comparison with previous IL methods and experimental validations, we show that demonstration efficiency can be achieved in MPC-compression by generating a corresponding Robust Tube MPC (RTMPC) and using the tube to guide the data augmentation strategy for IL methods (DAgger, BC). To this end, we propose a data-sparse, computationally efficient (i.e., scales linearly in state size) adversarial Sampling Augmentation (SA) strategy for data augmentation. We highlight that the proposed approach, for example, can be used to train the robot in a low-fidelity simulation environment while achieving robustness to real-world perturbations unseen during the training phase. We validate the proposed approach by providing the first experimental (hardware) demonstration of zero-shot transfer of a DNN-based trajectory tracking controller for an aerial robot, learned from a single demonstration, in an environment (low-fidelity simulation) without disturbances, and transferred to an environment with wind-like disturbances.

## Ii Related Work

MPC-like policy compression for mobile robots via IL and GPS. IL methods have found application in multiple robotics tasks. The works in [ross2013learning, pan2017agile] use DAgger [ross2011reduction] to control aerial and ground robots, while [kaufmann2020deep] uses a combination of DAgger and DR to learn to perform acrobatic maneuvers with a quadrotor. Similarly, GPS methods have been demonstrated in simulation for navigation of a multirotor [zhang2016learning, kahn2017plato], and to control a legged robot [carius2020mpc, reske2021imitation]. These methods achieve impressive performance, but at the cost of requiring multiple demonstrations to execute a single trajectory, and do not explicitly take into account the effects of disturbances encountered in the deployment domain.

Sample efficiency and robustness in IL. Robustness in IL, required to deal with distribution shifts caused by the sim2real (model errors) or lab2real transfer (model changes, external disturbances), has been achieved a) by modifying the training domain so that its dynamics match the deployment domain, as done in DR [peng2018sim, loquercio2019deep, farchy2013humanoid, chebotar2019closing], or b) by modifying the actions of the expert, so that the state distribution during training matches the one encountered at deployment, as proposed in [laskey2017dart, laskey2018and, hanna2017grounded, desai2020imitation]. Although effective, these approaches do not leverage extra information available in the RTMPC, thus requiring a larger number of demonstrations.

Data augmentation in IL. Data augmentation is a commonly employed robustification strategy in IL. Most approaches focus on reducing overfitting in the high-dimensional policy input space (e.g., images), by applying noise [florence2019self], transformations [hendrycks2019augmix] or adversarially-generated perturbations [shuadversarial, antotsiou2021adversarial], while maintaining the corresponding action label unchanged. Data augmentation is also employed to reduce covariate shift in self-driving by generating transformed observations [toromanoff2018end, amini2020learning, bojarski2016end] with the corresponding action label computed via a feedback controller. These approaches do not directly apply to our context, as they do not rely on RTMPC

and we assume available state estimate. Aligned to our findings,

[levine2013guided, carius2020mpc] observe that adding extra samples from the tube of an existing Iterative- Linear Quadratic Regulator (LQR) can achieve increased demonstration efficiency in GPS. Compared to these, thanks to RTMPC, we can additionally consider the effects of disturbances encountered in the sim2real or lab2real transfer, providing additional robustness.## Iii Method

This section explains the given MPC expert and its Robust-Tube variant, which we leverage to design a data augmentation strategy. We additionally cast the demonstration-efficiency challenge in IL as a robust IL problem in the context of transferring a policy between two different domains, and we present the SA strategy to improve demonstration efficiency and robustness of MPC-guided policies learned via IL.

### Iii-a MPC and Robust Tube MPC demonstrator

Model predictive trajectory tracking controller. We assume a trajectory tracking linear MPC [borrelli2017predictive] is given that controls a system subject to bounded uncertainty. The linearized, discrete-time model of the system is:

(1) |

where represents the state (size ), are the commanded actions (size ), and is an additive perturbation/uncertainty. and represent the nominal dynamics; denotes internal variables of the optimization. At every discrete timestep is given an estimate of the state of the actual system and a reference trajectory . The controller computes a sequence of actions , subject to state and input constraints , and executes the first computed optimal action from the sequence. The optimal sequence of actions is computed by minimizing the value function (dropping in the notation):

(2) |

where . Matrices (size ) and (size ) are user-selected, positive definite weights that define the stage cost, while (size , positive definite) represents the terminal cost. The prediction horizon is an integer . The system is additionally subject to constraints ; the predicted states are obtained from the model in Eq. 1 assuming the disturbance . The optimization problem is solved again at every timestep, executing the newly recomputed optimal action.

Robust Tube MPC. Given the MPC expert, we generate a Robust Tube variant using [mayne2005robust]. At every discrete timestep , the RTMPC operates in a similar way as MPC, but it additionally accounts for the effects of by introducing a feedback policy, called ancillary controller

(3) |

where represents the executed action. The quantities represents an optimal, feedforward action, and is an optimal reference, and are computed by the RTMPC given the current state estimate . The ancillary controller ensures that the state of the controlled system remains inside a set (tube) , centered around , for every possible realization of . The quantities and are obtained by solving (dropping the dependence on )

(4) |

under the constraint that . As in the original MPC formulation, the optimization problem is additionally subject to the given input and actuation constraints, tightened by an amount that takes into account the effects of the disturbances. The gain matrix in Eq. 3 is computed such that is stable, for example by solving the infinite-horizon, discrete-time LQR problem using (, , , ). The set has constant size, and it determines the shape/width of the tube. It is defined as disturbance invariant set for the closed-loop system , and satisfies the property that , , , . In practice, can be computed offline using and the model of the disturbance via ad-hoc algorithms [borrelli2017predictive, mayne2005robust], or can be learned from data [fan2020deep]. The set and the ancillary controller in Eq. 3 ensure (see [mayne2005robust]) that, given a state , the perturbed system in Eq. 1 will remain in the tube centered around the trajectory of , no matter the disturbance realization , as shown in Figure 2. This additionally implies that the tube represents a model of the states that the system may visit when subject to the disturbances in . The ancillary controller provides a computationally-effective way to generate a control action to counteract such perturbations.

### Iii-B Covariate shift in sim2real and lab2real transfer

This part describes the demonstration-efficiency issue in IL as the ability to efficiently predict and compensate for the effects of covariate shifts during real-world deployment. We assume that the causes of such distribution shifts can be modeled as additive state perturbations/uncertainties encountered in the deployment domains.

Policies and state densities. We model the dynamics of the real system as Markovian and stochastic [sutton2018reinforcement]. The stochasticity with respect to state transitions is introduced by unknown perturbations, assumed to be additive (as in Eq. 1) and belonging to the bounded set

, sampled under a (possibly unknown) probability distribution. These perturbations capture the effects of noise, approximation errors in the learned policy, model changes and other disturbances acting on the system. Two different domains

are considered: a training domain (source) and a deployment domain (target). The two domains differ in their transition probabilities, effectively representing the sim2real or lab2real settings. We additionally assume that the considered system is controlled by a deterministic policy

, where represents the reference trajectory. Given , the resulting transition probability is , denoted to simplify the notation. The probability of collecting a -step trajectory given a generic policy in is , where represents the initial state distribution.Robust IL objective. Following [laskey2017dart], given an expert RTMPC policy , the objective of IL is to find parameters of that minimize a distance metric . This metric captures the differences between the actions generated by the expert and the action produced by the learner across the distribution of trajectories induced by the learned policy , in the perturbed target domain :

(5) |

A choice of distance metric that we consider in this paper is the MSE loss: .

Covariate shift due to sim2real and lab2real transfer. Since in practice we do not have access to the target environment, our goal is to try to solve Eq. 5 by finding an approximation of the optimal policy parameters in the source environment:

(6) |

The way this minimization is solved depends on the chosen IL algorithm. The performance of the learned policy in target and source domain can be related via:

(7) |

which clearly shows the presence of a covariate shift induced by the transfer. The last term corresponds to the objective minimized by performing IL in . Attempting to solve Eq. 5 by directly optimizing Eq. 6 (e.g., via BC [pomerleau1989alvinn]) provides no guarantees of finding a policy with good performance in .

Compensating transfer covariate shift via Domain Randomization. A well known strategy to compensate for the effects of covariate shifts between source and target domain is Domain Randomization (DR) [peng2018sim], which modifies the transition probabilities of the source by trying to ensure that the trajectory distribution in the modified training domain matches the one encountered in the target domain: . This is in practice done by sampling perturbations according to some knowledge/hypotheses on their distribution in the target domain [peng2018sim], obtaining the perturbed trajectory distribution . The minimization of Eq. 5 can then be approximately performed by minimizing instead:

(8) |

This approach, however, requires the ability to apply disturbances/model changes to the system, which may be unpractical e.g., in the lab2real setting, and may require a large number of demonstrations due to the need to sample enough state perturbations .

### Iii-C Covariate shift compensation via Sampling Augmentation

We propose to mitigate the covariate shift introduced by the compression procedure not only by collecting demonstrations from the RTMPC, but by using additional information computed in the controller. Unlike DR, the proposed approach does not require to explicitly apply disturbances in the training phase. During the collection of a trajectory in the source domain , we utilize instead the tube computed by the RTMPC demonstrator to obtain knowledge of the states that the system may visit when subjected to perturbations. Given this information, we propose a state sampling strategy, called Sampling Augmentation (SA), to extract relevant states from the tube. The corresponding actions are provided at low computational cost by the demonstrator. The collected state-actions pairs are then included in the set of demonstrations used to learn a policy via IL. The following paragraphs frame the tube sampling problem in the context of covariate shift reduction in IL, and present two tube sampling strategies.

RTMPC tube as model of state distribution under perturbations. The key intuition of the proposed approach is the following. We observe that, although the density is unknown, an approximation of its support , given a demonstration collected in the source domain , is known. Such support corresponds to the tube computed by the RTMPC when collecting :

(9) |

where is a trajectory in the tube of . This is true thanks to the ancillary controller in Eq. 3, which guarantees that the system remains inside Eq. 9 for every possible realization of . The ancillary controller additionally provides a way to easily compute the actions to apply for every state inside the tube. Let be a state inside the tube computed when the system is at (formally ), then the corresponding robust control action is simply:

(10) |

For every timestep in , extra state-action samples collected from within the tube can be used to augment the dataset employed for empirical risk minimization, obtaining a way to approximate the expected risk in the domain by only having access to demonstrations collected in :

(11) |

The demonstrations in the source domain can be collected using existing techniques, such as BC and DAgger.

Tube approximation and sampling strategies.

In practice, the set may have arbitrary shape (not necessarily politopic), and the density may not be available, making difficult to establish where/which states to sample in order to derive a data augmentation strategy. We proceed by approximating as an hyper-rectangle , outer approximation of the tube. We consider an adversarial approach to the problem by sampling from the states visited under worst-case perturbations. We investigate two strategies, shown in Figure 3, to obtains state samples at every state in : i) dense sampling: sample extra states from the vertices of . The approach produces extra state-action samples. It is more conservative, as it produces more samples, but more computationally expensive. ii) sparse sampling: sample one extra state from the center of each facet of , producing additional state-action pairs. It is less conservative and more computationally efficient.

## Iv Results

### Iv-a Evaluation approach

MPC for trajectory tracking on a multirotor. We evaluate the proposed approach in the context of trajectory tracking control for a multirotor, using the controller proposed in [kamel2017linear], modified to obtain a RTMPC. We model under the assumption that the system is subject to force-like perturbations up to of the weight of the robot (approximately the safe physical limit of the robot). The tube is approximated via Monte-Carlo sampling of the disturbances in , evaluating the state deviations of the closed loop system . The derived controller generates tilt (roll, pitch) and thrust commands () given the state of the robot ( consisting of position, velocity and tilt) and the reference trajectory. The reference is a sequence of desired positions and velocities for the next s, discretized with sampling time of s (corresponding to a planning horizon of , and

-dim. vector). The controller takes into account position constraints (e.g., available 3D flight space), actuation limits, and velocity/tilt limits.

Policy architecture. The compressed policy is a -hidden layers, fully connected DNN, with neurons per layer, and ReLU activation function. The total input dimension of the DNN is (position, velocity, current tilt expressed in an inertial frame, and the desired reference trajectory). The output dimension is (desired thrust and tilt expressed in an inertial frame). We rotate the tilt output of the DNN in body frame to avoid taking into account yaw, which is not part of the optimization problem [kamel2017linear], not causing any relevant computational cost. We additional apply the non-linear attitude compensation scheme as in [kamel2017linear].

Training environment and training details. Training is performed in a custom-built non-linear quadrotor simulation environment, where the robot follows desired trajectories, starting from randomly generated initial states, centered around the origin. Demonstrations are collected with a sampling time of s and training is performed for epochs via the ADAM [kingma2014adam] optimizer, with a learning rate of .

Evaluation details and metrics. We apply the proposed SA strategies to DAgger and BC, and compare their performance against the two without SA, and the two combined with DR. Target and source domain differs due to perturbations sampled from in target. During training with DR we sample disturbances from the entire . In all the comparisons, we set the probability of using actions of the expert

, hyperparameter of DAgger

[ross2011reduction], to be at the first demonstration, and otherwise (as this was found to be the best performing setup). We monitor: i) robustness (Success Rate), as the percentage of episodes where the robot never violates any state constraint; ii) performance (MPC Stage Cost), as along the trajectory.### Iv-B Numerical evaluation of demonstration-efficiency, robustness and performance for tracking a single trajectory

Method | Training | Robustness succ. rate (%) | Performance expert gap (%) | Demonstration Efficiency | |||||

Robustif. | Imitation | Easy | Safe | T1 | T2 | T1 | T2 | T1 | T2 |

- | BC | Yes | Yes | < 1 | 100 | 24.15 | 29.47 | - | 6 |

DAgger | Yes | No | 98 | 100 | 15.79 | 1.34 | 7 | 6 | |

DR | BC | No | Yes | 95 | 100 | 10.04 | 1.27 | 14 | 9 |

DAgger | No | No | 100 | 100 | 4.09 | 1.45 | 10 | 6 | |

SA-Dense | BC | Yes | Yes | 100 | 100 | 25.64 | 1.34 | 1 | 1 |

DAgger | Yes | Yes | 100 | 100 | 10.21 | 1.66 | 1 | 1 | |

SA-Sparse | BC | Yes | Yes | 100 | 100 | 4.23 | 1.13 | 1 | 1 |

DAgger | Yes | Yes | 100 | 100 | 3.75 | 1.07 | 1 | 1 |

Tasks description. Our objective is to compress an RTMPC policy capable to track a s long, eight-shaped trajectory. We evaluate the considered approaches in two different target domains, with wind-like disturbances (T1) or model errors (T2). Disturbances in T1 are sampled adversarially from (25–30% of the UAV weight), while model errors in T2 are applied via mismatches in the drag coefficients used between training and testing.

Results. We start by evaluating the robustness in T1 as a function of the number of demonstrations collected in the source domain. The results are shown in Fig 4, highlighting that: i) while all the approaches achieve robustness (full success rate) in the source domain, SA achieves full success rate after only a single demonstration, being 5-6 times more sample efficient than the baseline methods; ii) SAis able to achieve full robustness in the target domain, while baseline methods do not fully succeed, and converge at much lower rate. These results remark the presence of a distribution shift between the source and target, which is not fully compensated by baselines methods such as BC, due to lack of exploration and robustness. The performance evaluation and additional results are summarized in Table I. We highlight that in the target domain, sparse SA combined with DAgger achieves closest performance to the expert. Dense SA suffers from performance drops, potentially due to the limited capacity of the considered DNN or challenges in training introduced by this data augmentation. Because of its effectiveness and greater computational efficiency, we use sparse SA for the rest of the work. Table I additionally presents the results for task T2. Although this task is less challenging (i.e., all the approaches achieve full robustness), the proposed method (sparse SA) achieves highest demonstration efficiency and lowest expert gap, with similar trends as T1.

Computation. The average latency (on i7-8750H laptop with NVIDIA GTX1060 GPU) for the expert (MATLAB) is

ms, while for the compressed policy (PyTorch) is

ms, achieving a two-orders of magnitude improvement. The average latency for the compressed policy on an Nvidia TX2 CPU (PyTorch) is ms.### Iv-C Hardware evaluation for tracking a single trajectory

We validate the demonstration efficiency, robustness and performance of the proposed approach by experimentally testing policies trained after a single demonstration collected in simulation using DAgger/BC (which operate identically since we use DAgger with for the first demonstration). The data augmentation strategy is based on the sparse SA

. We use the MIT/ACL open-source snap-stack

[acl_snap_stack] for controlling the attitude of the MAV, while the compressed RTMPC runs at Hz on the onboard Nvidia TX2 (on its CPU), with the reference trajectory provided at Hz. State estimation is obtained via a motion capture system or onboard VIO. The first task considered is to hover under wind disturbances produced by a leaf blower. The results are shown in Figure 5, and highlight the ability of the system to remain stable despite the large position error caused by the wind. The second task is to track an eight-shaped trajectory, with velocities up to m/s. We evaluate the robustness of the system by applying a wind-like disturbance produced by an array of leaf blowers (Figure 6). The given position reference and the corresponding trajectory are shown in Figure 5(a). The effects of the wind disturbances are clearly visible in the altitude errors and changes in commanded thrust in Figure 5(a) (at s and s). These experiments show that the controller can robustly track the desired reference, withstanding challenging perturbations unseen during the training phase. The video submission presents more experiments, including lab2real transfer.### Iv-D Numerical and hardware evaluation for learning and generalizing to multiple trajectories

We evaluate the ability of the proposed approach to track multiple trajectories while generalizing to unseen ones. To do so, we define a training distribution of reference trajectories (circle, position step, eight-shape) and a distribution for these trajectory parameters (radius, velocity, position). During training, we sample at random a desired, s long ( steps) reference with randomly sampled parameters, generating a demonstration and updating the proposed policy, while testing on a set of , s long trajectories randomly sampled from the defined distributions. We monitor the robustness and performance of the different methods, with force disturbances applied in the target domain. The results of the numerical evaluation, shown in Figure 7, confirm that sparse SA i) achieves robustness and performance comparable to the one of the expert in a sample efficient way, requiring less than half the demonstrations than baseline approaches; ii) simultaneously learns to generalize to multiple trajectories randomly sampled from the training distribution. The hardware evaluation, performed with DAgger augmented via sparse SA, is shown in Figure 8. It confirms that the proposed approach is experimentally capable of tracking multiple trajectories under real-world disturbances/model errors.

## V Conclusion and Future Work

This work has presented a demonstration-efficient strategy to compress a MPC in a computationally efficient representation, based on a DNN, via IL. We showed that greater sample efficiency and robustness than existing IL methods (DAgger, BC and their combination with DR) can be achieved by designing a Robust Tube variant of the given MPC, using properties of the tube to guide a sparse data augmentation strategy. Experimental results – showing trajectory tracking control for a multirotor after a single demonstration under wind-like disturbances – confirmed our numerical findings. Future work will focus on designing an adaptation strategy.

## Acknowledgment

This work was funded by the Air Force Office of Scientific Research MURI FA9550-19-1-0386.

Comments

There are no comments yet.