Quad2Plane: An Intermediate Training Procedure for Online Exploration in Aerial Robotics via Receding Horizon Control

by   Alexander Quessy, et al.
University of Bristol

Data driven robotics relies upon accurate real-world representations to learn useful policies. Despite our best-efforts, zero-shot sim-to-real transfer is still an unsolved problem, and we often need to allow our agents to explore online to learn useful policies for a given task. For many applications of field robotics online exploration is prohibitively expensive and dangerous, this is especially true in fixed-wing aerial robotics. To address these challenges we offer an intermediary solution for learning in field robotics. We investigate the use of dissimilar platform vehicle for learning and offer a procedure to mimic the behavior of one vehicle with another. We specifically consider the problem of training fixed-wing aircraft, an expensive and dangerous vehicle type, using a multi-rotor host platform. Using a Model Predictive Control approach, we design a controller capable of mimicking another vehicles behavior in both simulation and the real-world.


Evaluating task-agnostic exploration for fixed-batch learning of arbitrary future tasks

Deep reinforcement learning has been shown to solve challenging tasks wh...

Zero-Shot Uncertainty-Aware Deployment of Simulation Trained Policies on Real-World Robots

While deep reinforcement learning (RL) agents have demonstrated incredib...

Learning Real-World Robot Policies by Dreaming

Learning to control robots directly based on images is a primary challen...

A System for General In-Hand Object Re-Orientation

In-hand object reorientation has been a challenging problem in robotics ...

Learning Active Task-Oriented Exploration Policies for Bridging the Sim-to-Real Gap

Training robotic policies in simulation suffers from the sim-to-real gap...

A User's Guide to Calibrating Robotics Simulators

Simulators are a critical component of modern robotics research. Strateg...

Switching Model Predictive Control for Online Structural Reformations of a Foldable Quadrotor

The aim of this article is the formulation of a switching model predicti...

I Introduction

In the past decade Deep Reinforcement Learning has proven to be an effective tool to solve a variety of simulated

[31][36] and real-world [33][26] control tasks. Real-world robotic learning is plagued by a host of practical issues [48]

such as: resets, state-estimation and platform integrity. These practical limitations are easily solved by learning in simulation, but this often leads to a

reality gap, where the policy learned by the agent learns to exploit attributes that are not present in the real-world, leading to poor performance. A common robotic learning pipeline is therefore to complete the majority of training in simulation, where demonstrations are cheap, and then fine-tune the learned policy on a real-world robot.

This approach works well in robotic domains where the vehicle can safely explore the environment without significant constraints, such as gripper [25] robots. However, in many field-robotic settings it is not safe to allow the robot to explore and fine-tune a policy in the real-world, as the vehicle can damage itself and 3rd parties. This is a challenge that is particularly constraining in fixed-wing aerial robotics, where the platform is often large and financially expensive. Further, we are often most interested in fine-tuning policies when the aircraft is slow and close to the ground [4] [11], where it is most likely to cause damage. This makes real-world fixed-wing robotic learning difficult constraining most research to simulation [1].

In comparison, multi-rotor drones are ubiquitous in aerial robotic research [6], and have been used for online robotic learning in a variety of safe [27] and constrained real-world environments [23]. Fixed-wing aircraft typically have superior range, endurance and load-carrying capabilities than rotary aircraft. However, this is not a constraint on the majority of learning tasks, because most stochastic control problems encountered by the robot, where using a DRL based controller would be useful, occur at the beginning or end of a flight and carrying a payload does not effect the learning process. Along with being safer, multi-rotors offer several distinct advantages, compared to fixed-wing aircraft, for online learning:

  • Multi-rotors can hover and translate linearly in all three flight-axis, making it easier to reset the learning process online. [10].

  • Multi-rotors are not as geometrically constrained as fixed-wing aircraft, allowing for easier installation of compute devices and sensors. This makes it easier to get accurate state-estimation and to perform online policy optimization [19].

  • Multi-rotors have a larger low-speed flight envelope, easily being able to hover and navigate tight, constrained environments[12]. This allows us to train the robot in scenarios closer to the platforms limits without the risk of crashing, opening the door for safe-RL [38] and lifelong learning [42].

Given how much easier it is to learn control policies on multi-rotors than fixed-wing aircraft: can a multi-rotor mimic the mechanics of a fixed-wing aircraft? This would allow us to learn useful policies in the real-world, without the burden of reset-ability and safety imposed by a fixed-wing aircraft. All whilst providing access to the underlying real-world state-observation distribution, including wind and imaging data, which is difficult to entirely capture in simulation [5].

In this paper we present a general procedure to mimic the dynamics of a target vehicle on a dissimilar platform vehicle. Specifically we consider the problem of replicating the dynamics of a fixed-wing target vehicle on a multi-rotor platform. We make the following contributions:

  • We pose the general platform-to-target control problem as a discrete time dynamic problem in section III. We then design an optimal model based controller to solve this problem in section IV.

  • In section V we provide a description of the implementation of our controller for aerial robotics, along with the rationale behind the design of our vehicles state based cost function. This offers useful insights into how we need to design vehicles that are used for robotic learning.

  • In section VI we use a combination of simulated and real-world data to validate our approach and describe the key limitations of our approach.

Ii Related Work

The problem of transferring learned skills from simulation to the real-world is well documented within the robotic learning community

[41][34][40]. This reality gap is caused by a distributional shift from simulated to real-world observations, due to un-modelled physical effects such as noise in a collected image or friction in an actuator. One approach to solve this problem is to vary the state distribution of the simulation, either by randomizing the domain [43][35] or dynamics [32]. However, Domain Randomization (DR) has some key limitations: the creation of suitable simulated training environments is time consuming, and deciding what to randomize requires care to ensure the final distributions match. Additionally there is no guarantee that DR can achieve zero-shot sim-to-real transfer [44].

Alternatively, demonstrations can be used to help fine-tune the policy [18][13][28], but for many settings it is difficult to obtain expert demonstrations to learn from. Learning in the real-world is perhaps the simplest way to ensure we don’t have a reality gap, either by fine-tuning [22], or without simulation altogether [39]. Unfortunately this requires our robotic platform to be capable of safe exploration, something that cannot be assured for our target vehicle. We offer an intermediary solution to this, by learning on a surrogate vehicle, we contribute a procedure to train inherently dangerous vehicles in the real-world safely.

Mimicking the dynamics of one vehicle with another has been used by the aerospace community for flight crew training [2] and aircraft flying quality evaluation [17] since the late 1950s [37]. These In Flight Trainer (IFT) aircraft [8] [24] typically use linear state-space model reference adaptive controllers based on or LQR to mimic another aircraft’s flight mechanics. This makes sense for IFT aircraft, where the objective is typically to provide proprioceptive information to the flight crew. However, it is desirable for our system to also be capable of replicating the vehicles non-linear dynamics, linear state-space control is therefore inappropriate.

Model Predictive Control (MPC) is an effective method to control non-linear systems [30], especially when constraints are imposed on exploration [46]. Classical MPC provides a complete trajectory to solve an optimal control problem, for complex optimization functions, as is experienced in non-linear control, this is often computationally intractable. Receding-horizon differential dynamic programming [9], helps to address this by optimizing over a shorter (receding) horizon, rather than the full control time of the process. In this research we design a non-linear receding model predictive controller [14], and find this is a procedure capable of mimicking the dynamics of our target fixed-wing vehicle, provided sufficient compute, time-horizon and model accuracy.

Iii Problem Setting

Whilst the primary focus of our work is in matching the dynamics of dissimilar aircraft types, the procedure we develop is applicable to many other robotic learning tasks. Consider a discrete time dynamic system as in


Where is the state of a vehicle at time , is the input to the vehicle at time , and is the vehicles state-transition. Under this framework we consider target and platform vehicles with transition functions and respectively. Our objective is therefore to design a controller , as in (2), so that when receives a control input it undergoes the same state-transition as , receiving an input , as shown in figure 1.

Fig. 1: Problem Formulation Block Diagram

We assess the performance of

using a loss function between the target and platform trajectory terms


Iv Model Predictive Control

We can frame the control design problem for as an optimal control problem, for which receding horizon MPC offers a useful framework. We assume that the transition of the vehicles dynamics is deterministic, and the model representation entirely captures the vehicles dynamics.

Consider the cost function , composed of state and control dependant costs, as in


The objective is then to find the optimal control sequence that minimizes over a finite time-horizon , subject to the linear bounding constraint . To help reduce the computational cost at each time the optimization problem is warm-started with the previous control sequences solution, reducing the computational cost of the online optimization problem.

Control sequence with finite length is the control input sequence received by . This is the command sequence produced by the outer controller we are aiming to train. In a typical model-free RL setting only a single next time-step would be provided and the MPC time-horizon would effectively be one step. We found that this often causes instability in our platform vehicle as the MPC controller is unable to solve the non-linear control problem, effectively getting stuck in a local optima. To learn an RL policy it is therefore necessary to rollout a control trajectory steps ahead. When the policy is deployed online only the first state from the policies action trajectory will be selected.

Algorithm 1, is a general MPC based procedure to mimic the dynamics of a target vehicle.. We denote the target vehicle model and platform model as and respectively.

0:  : target control input sequence, : last control prediction sequence (up to )
0:  : next control input to platform
2:  for  to  do
3:     ; add to
4:  end for
6:  while not converged do
7:     for  to  do
11:     end for
13:  end while
Algorithm 1 t2p-MPC

V Implementation

V-a Simulation

For the target fixed-wing () we use the x8 Skywalker model from [15], and simulate the aircraft using JSBSim [29]. For the multi-rotor () we provide our own implementation contained within this papers associated repository 111https://github.com/AOS55/Drone2Plane

. Both aircraft have the same 6 degree of freedom 12 state representation form

: 6 translational components and 6 rotational components . Both aircraft have 4 controls with normalized inputs for control-deflection and thrust:

  • The multi-rotor has 4 thrust control commands, one for each prop: , where is bound from .

  • The fixed-wing airplane is modelled with 3 control surface commands and one throttle command: = , with bounds: .

The actual x8 airplane flown, shown in figure 2, has 2 full wing control surfaces () which are transformed into equivalent elevator & aileron within the x8 wind tunnel model [15]. The x8’s large control surfaces provide lots of control authority in pitch and roll, but the lack of vertical tailplane results in poor directional control and balance.

Fig. 2: Example platform multi-rotor vehicle and target fixed-wing x8 airplane. Aircraft body axis states are labelled on the multi-rotor.

Our multi-rotor simulator includes models for drag, motor-thrust and gravitational effects. The vehicle states are updated with a two-step forward euler method using the linear and rotational accelerations generated by these forces. We specified the vehicles performance to satisfy the x8 target platforms required flight envelope, principally by observing when the multi-rotor’s motors became saturated when responding to a step-control disturbance on the airplane. This produced a thrust to weight requirement of , performance that is not unrealistic for many small to medium size off-the-shelf quad-copters [7].

V-B Mpc

For our MPC control task we use a state-dependent cost function of the form


Where the state weight matrix has parameters


This corresponds to minimizing the position error between the fixed-wing and multi-rotor aircraft. This is suitable for higher order fixed-wing control tasks, based on temporal position based objectives [20]. If mimicking rotational vehicle states is desirable, it would not be difficult to include a gimbal controller to directly passthrough the fixed-wing rates from the model. Allowing us to train a vision encoded policy as the attitude of the fixed-wing aircraft has a direct relation to the perspective of the policy. For the control-dependant cost function we apply a constant cost weight to all thrust terms of . This improves the convergence rate when close to an optimal solution and helps to reduce jittering. To minimize the cost function, , we use the Sequential Least Squares Programming (SLSQP) routine from the Python based SciPy optimization library [45] with constraints . We use a constant time-horizon of 1.0s.

Vi Experiments

To investigate the limitations of our surrogate learning approach we consider the following procedures:

  • A disturbance in roll and pitch, to investigate the quad-copter’s limitations to mimic the dynamics of the fixed-wing at the limits of the flight envelope.

  • A comparison to real-world flight data, to investigate our procedures ability to mimic the aeroplane.

Vi-a Disturbance Tracking

Figures 3 & 4 plot the 3 dimensional position and linear velocity states for the two vehicles respectively. The orange dash-dot vertical line in both figures is the point the disturbance was initiated. For the pitch and roll disturbances we apply an instantaneous up elevator and left aileron deflection for 0.2 respectively. In both cases we use 50 % of the maximum control deflection for the disturbance.

Fig. 3: The multi-rotor (solid line) tracks the pitch disturbance well, but becomes saturated when aiming to maintain level with the fixed-wing (dashed-diamond line) following the roll disturbance.
Fig. 4: The multi-rotor maintains a similar average speed throughout the manoeuvre in order to minimize the position cost. Note, the multi-rotor does not track linear velocity as part of its cost function.

Control saturation is shown following the disturbance in roll at 5 seconds on figure 3 as the quad-copter encounters a slight (10m) departure from tracking in altitude. The system effectively runs out of power required to maintain rate of climb whilst turning. The controller is able to maintain higher power for longer however and recaptures the control by effectively leading the velocity of the fixed-wing aircraft, shown from 9.0s to 12.0s on figure 4. We find this analysis useful to understand the limitations of our platform vehicle, if tracking closely in this domain was desirable we could increase the performance of our vehicle to maintain position.

Fig. 5: The control input to each of the quad-copter’s 4 motors under the pitch and roll disturbances. The roll disturbance is saturated from 5.5s to 10.0s

Figure 5 shows the control input required to mimic the disturbances of the fixed wing for each disturbance. The noise at the beginning of each control plot is caused by cold-starting the optimizer, as the control sequence

starts with all values set to zero. The solution for the optimizer is then progressively improved by warm-starting the optimizer with the last best solution. Once converged to a local optima the variance in control input decays and the noise in the system reduces. Importantly this doesn’t cause the aircraft to diverge from the mimicked dynamics, despite following a sub-optimal solution.

Fig. 6: With a lag, the controller response is noisier, but tracking accuracy (not shown) is unchanged.

To simulate the effect of un-modelled dynamics we add a first-order lag filter to the multi-rotor’s motors given as


Where is the command sent to the linear motor model at time , is the step-size of the simulation, and is the size of the lag. But, we do not include the lag model within the MPC model. We found this did not cause any significant effects on the ability of the quad-copter to track the commanded signal but, as expected, slightly increases the noise of the motor input command, shown in figure 6.

Vi-B Real-World Comparison

To validate our approach on a real-world dataset we use our algorithm to mimic the motion of an x8 on a gas-sensing mission to the summit of Volcán de Fuego in Guatemala using the same platform drone. We consider a demanding 60 second climbing turn section, shown in figure 7.

Fig. 7: Climbing Turn section starting at 15,000’, around Volcán de Fuego. Solid orange segment indicates flight track used over the 60 second period.

We find the multi-rotor platform is capable of tracking the fixed-wing demonstration with little to no error through the whole manoeuvre with relatively small control noise, shown in figure 8. Interestingly the controller approaches saturation during the level out from the turn, where it needs to turn aggressively and continue the climb.

Fig. 8: Position during manoeuvre in figure 7, aircraft dashed, multi-rotor solid, showing little to no error. Lower plot shows the control commands required to track the manoeuvre.

We find that the trajectory tracking error of the real world data in figure 8 is relatively small, with mean squared error in positions of respectively.

Vii Future Work & Limitations

A key limitation of our approach is the time taken to calculate trajectories for our MPC loop. Even in the best case the frequency of our controller to update on modern hardware is 10 to 100 times slower than is required to run online. A large constraint to this is the need to use a sequential quadratic programming solver, as the core algorithm is fundamentally serial and the computational expense to calculate gradients and simulate model state roll-outs is significant. Information Theoretic MPC methods such as MPPI [47]

offer a solution to this challenge, via a sampling based MPC procedure, based on KL divergence and free energy. This allows us to take full advantage of parallel compute and does not require gradients to be calculated for the optimization procedure. Initially, we found that running MPPI, with the quad-copter simulator on parallel CPU compute, provided insufficient samples to converge to a stable solution. We therefore aimed to use a neural-network to represent our model dynamics, allowing us to take advantage of accelerated parallel GPU compute and collect many more samples faster. To train our network we investigated bootstrapping from demonstrations collected with both random actions and an MPC oracle along with simply training within the MPPI framework. A unique challenge that stems from the quad-copters fundamental instability is that random actions frequently lead to the aircraft rotating into an uncontrolled state. The original MPPI paper

[47] solves this on the quad-copter task by including angular rates as part of the control input. We found that whilst position and linear/angular velocities can be predicted with small error, angles could only be learned to within of error, which is far too large of an error to use with MPPI.

More recent Model-Based RL algorithms such as MBPO [21] or PETS [3] may offer a better solution to solving our control problem, with their CMA-ES ensemble maximum likelihood [16] based model learning procedure. Alternatively, we may simply need to collect more demonstrations for supervised training, either from the MPC oracle or human demonstrations.

Once this computational bottleneck is overcome we aim to experiment with the controller in the real-world. Such as: online episodic reinforcement learning and safety constrained RL tasks, for example flying at high speed around a car-park or navigating to landing spots in built-up areas.


  • [1] E. Bohn, E. M. Coates, S. Moe, and T. A. Johansen (2019-06) Deep reinforcement learning attitude control of fixed-wing uavs using proximal policy optimization. 2019 International Conference on Unmanned Aircraft Systems (ICUAS). External Links: Link, Document Cited by: §I.
  • [2] Calspan(Website) Note: 2022-02-14 External Links: Link Cited by: §II.
  • [3] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §VII.
  • [4] R. J. Clarke, L. Fletcher, C. Greatwood, A. Waldock, and T. S. Richardson (2020-01) Closed-Loop Q-Learning Control of a Small Unmanned Aircraft. In AIAA Scitech 2020 Forum, Orlando, FL. External Links: Document, ISBN 978-1-62410-595-1 Cited by: §I.
  • [5] DeepMind(Website) Note: 2022-02-14 External Links: Link Cited by: §I.
  • [6] X. Ding, P. Guo, K. Xu, and Y. Yu (2019-01) A review of aerial manipulation of small-scale rotorcraft unmanned robotic systems. Chinese Journal of Aeronautics 32 (1), pp. 200–214. External Links: ISSN 10009361, Document Cited by: §I.
  • [7] DJI(Website) Note: 2022-02-24 External Links: Link Cited by: §V-A.
  • [8] R. S. Edmonson and J. Kemper (2019-01) Adaptive flight control systems on calspan learjet. In AIAA Scitech 2019 Forum, San Diego, California. External Links: Document, ISBN 978-1-62410-578-4 Cited by: §II.
  • [9] T. Erez, K. Lowrey, Y. Tassa, V. Kumar, S. Kolev, and E. Todorov (2013) An integrated system for real-time model predictive control of humanoid robots. In 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), Vol. , pp. 292–299. External Links: Document Cited by: §II.
  • [10] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine (2018) Leave no trace: learning to reset for safe and autonomous reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: 1st item.
  • [11] L. J. Fletcher, R. J. Clarke, T. S. Richardson, and M. Hansen (2021-01-04) Reinforcement learning for a perched landing in the presence of wind. In AIAA Scitech 2021 Forum, (English). Note: 2021 AIAA SciTech Forum ; Conference date: 04-01-2021 Through 15-01-2021 External Links: Document, Link Cited by: §I.
  • [12] J. V. Foster and D. Hartman (2017-06) High-Fidelity Multi-Rotor Unmanned Aircraft System (UAS) Simulation Development for Trajectory Prediction Under Off-Nominal Flight Dynamics. In 17th AIAA Aviation Technology, Integration, and Operations Conference, Denver, Colorado. External Links: Document, ISBN 978-1-62410-508-1 Cited by: 3rd item.
  • [13] D. Gandhi, L. Pinto, and A. K. Gupta (2017) Learning to fly by crashing. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3948–3955. Cited by: §II.
  • [14] S. Gros, M. Zanon, R. Quirynen, A. Bemporad, and M. Diehl (2020) From linear to nonlinear mpc: bridging the gap via the real-time iteration. International Journal of Control 93, pp. 62 – 80. Cited by: §II.
  • [15] K. Gryte, R. Hann, M. Alam, J. Rohac, T. Johansen, and T. Fossen (2018-06) Aerodynamic modeling of the skywalker x8 fixed-wing unmanned aerial vehicle. pp. 826–835. External Links: Document Cited by: §V-A, §V-A.
  • [16] N. Hansen and A. Ostermeier (2001) Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation 9 (2), pp. 159–195. External Links: Document Cited by: §VII.
  • [17] R. P. Harper and G. E. Cooper (1986-09) Handling qualities and pilot evaluation. Journal of Guidance, Control, and Dynamics 9 (5), pp. 515–529. External Links: ISSN 0731-5090, 1533-3884, Document Cited by: §II.
  • [18] M. Hazara and V. Kyrki (2019) Transferring generalizable motor primitives from simulation to real world. IEEE Robotics and Automation Letters 4 (2), pp. 2172–2179. External Links: Document Cited by: §II.
  • [19] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine (2021-04) How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research 40 (4-5), pp. 698–721. External Links: ISSN 0278-3649, 1741-3176, Document Cited by: 2nd item.
  • [20] ICAO (2016) Procedures for air navigation services, air traffic management. In DOC 4444, pp. 142–164. Cited by: §V-B.
  • [21] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §VII.
  • [22] R. C. Julian, B. Swanson, G. S. Sukhatme, S. Levine, C. Finn, and K. Hausman (2020) Never stop learning: the effectiveness of fine-tuning in robotic reinforcement learning. In CoRL, Cited by: §II.
  • [23] E. Kaufmann, A. Loquercio, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza (2018) Deep drone racing: learning agile flight in dynamic environments. In CoRL, Cited by: §I.
  • [24] M. E. Knapp, T. Berger, M. Tischler, and M. C. Cotting (2018-01) Development of a Full Envelope Flight Identified F-16 Simulation Model. In 2018 AIAA Atmospheric Flight Mechanics Conference, Kissimmee, Florida. External Links: Document, ISBN 978-1-62410-525-8 Cited by: §II.
  • [25] A. X. Lee, C. M. Devin, Y. Zhou, T. Lampe, K. Bousmalis, J. T. Springenberg, A. Byravan, A. Abdolmaleki, N. Gileadi, D. Khosid, C. Fantacci, J. E. Chen, A. Raju, R. Jeong, M. Neunert, A. Laurens, S. Saliceti, F. Casarini, M. Riedmiller, raia hadsell, and F. Nori (2021) Beyond pick-and-place: tackling robotic stacking of diverse shapes. In 5th Annual Conference on Robot Learning, External Links: Link Cited by: §I.
  • [26] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016-01) End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17 (1), pp. 1334–1373. External Links: ISSN 1532-4435 Cited by: §I.
  • [27] A. Loquercio, A. I. Maqueda, C. R. del-Blanco, and D. Scaramuzza (2018) DroNet: learning to fly by driving. IEEE Robotics and Automation Letters 3 (2), pp. 1088–1095. External Links: Document Cited by: §I.
  • [28] Y. Lu, K. Hausman, Y. Chebotar, M. Yan, E. Jang, A. Herzog, T. Xiao, A. Irpan, M. Khansari, D. Kalashnikov, and S. Levine (2021) AW-opt: learning robotic skills with imitation andreinforcement at scale. In 5th Annual Conference on Robot Learning, External Links: Link Cited by: §II.
  • [29] M. Madden Architecting a simulation framework for model rehosting. In AIAA Modeling and Simulation Technologies Conference and Exhibit, pp. . External Links: Document, Link, https://arc.aiaa.org/doi/pdf/10.2514/6.2004-4924 Cited by: §V-A.
  • [30] D. Q. Mayne (2014) Model predictive control: recent developments and future promise. Automatica 50 (12), pp. 2967–2986. External Links: Document, ISSN 0005-1098, Link Cited by: §II.
  • [31] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. External Links: 1312.5602 Cited by: §I.
  • [32] F. Muratore, C. Eilers, M. Gienger, and J. Peters (2021) Data-efficient domain randomization with bayesian optimization. IEEE Robotics and Automation Letters 6, pp. 911–918. Cited by: §II.
  • [33] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba (2019) Learning dexterous in-hand manipulation. External Links: 1808.00177 Cited by: §I.
  • [34] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. External Links: Document Cited by: §II.
  • [35] F. Sadeghi and S. Levine (2017-07) CAD2RL: real single-image flight without a single real image. In 13th converences on Robotics: Science and Systems, pp. 34–44. External Links: Document Cited by: §II.
  • [36] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2017-04) Trust Region Policy Optimization. arXiv:1502.05477 [cs]. External Links: 1502.05477 Cited by: §I.
  • [37] M. Shafer (1992) In-flight simulation studies at the nasa dryden flight research facility. Scientific and Technical Information Program. Cited by: §II.
  • [38] K. P. Srinivasan, B. Eysenbach, S. Ha, J. Tan, and C. Finn (2020) Learning to be safe: deep rl with a safety critic. ArXiv abs/2010.14603. Cited by: 3rd item.
  • [39] C. Sun, J. Orbik, C. M. Devin, B. H. Yang, A. Gupta, G. Berseth, and S. Levine (2021) Fully autonomous real-world reinforcement learning with applications to mobile manipulation. In 5th Annual Conference on Robot Learning, External Links: Link Cited by: §II.
  • [40] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A survey on deep transfer learning. Note: cite arxiv:1808.01974Comment: The 27th International Conference on Artificial Neural Networks (ICANN 2018) External Links: Link Cited by: §II.
  • [41] M. E. Taylor and P. Stone (2009-12) Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, pp. 1633–1685. External Links: ISSN 1532-4435 Cited by: §II.
  • [42] S. Thrun (1998) Lifelong learning algorithms. In Learning to Learn, pp. 181–209. External Links: ISBN 0792380479 Cited by: 3rd item.
  • [43] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §II.
  • [44] E. Valassakis, Z. Ding, and E. Johns (2020) Crossing the gap: a deep dive into zero-shot sim-to-real transfer for dynamics. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5372–5379. Cited by: §II.
  • [45] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Document Cited by: §V-B.
  • [46] K. P. Wabersich and M. N. Zeilinger (2018) Safe exploration of nonlinear dynamical systems: a predictive safety filter for reinforcement learning. ArXiv abs/1812.05506. Cited by: §II.
  • [47] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou (2017) Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1714–1721. External Links: Document Cited by: §VII.
  • [48] H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V. Kumar, and S. Levine (2020) The ingredients of real world robotic reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §I.