Robots carry the promise of being the future backbone of transport, warehousing, and manufacturing. However, for ubiquitous robotics to materialize, we need to devise methods to develop robotic controllers faster and autonomously—leveraging machine learning and scaling up current design approaches. Top computing hardware and software companies (including Nvidia, Google , and intrinsic ) are now working towards fast physics-based simulations for robot learning. At the same time, because safety is a crucial component of cyber-physical systems operating in the real world, safe learning-based control and safe reinforcement learning (RL) have become bustling areas of academic research over the past few years .
Nonetheless, the fast-paced progress of the field risks exacerbating some of the open problems of safe learning control. The continuous influx of new contributions can hamper the ability to discern the more significant results. We need to establish ways to fairly compare results yielded by learning-based controllers that leverage very different methodologies (as well as shared tools for the development and debugging of these controllers). We also need shared definitions and—more importantly—quantitative benchmarks to assess these controllers’ safety and robustness.
Our work was motivated by the lack of open-source, re-usable tools for the comparison of learning-based control research and RL, as observed in . While we acknowledge the importance of eventual real-robot experiments, here, we focus on simulation as a way to lower the barrier of entry and appeal to a larger fraction of the control and RL communities. To develop safe learning-based robot control, we need a simulation API that can (i) support model-based approaches, (ii) express safety constraints, and (iii)
capture real-world non-idealities (such as uncertain physical properties and state estimation). Our ambition is that our software could bring closer, support, and speed up the work of control and RL researchers, allowing them to easily compare results. We strive for simple, modular, and reusable code which leverages two open-source tools popular with each of the two communities: PyBullet’s physics engine and CasADi’s symbolic framework .
|safe-control-gym||Bullet||TinyRenderer, OpenGL||Cart-pole, Quadrotor||Stabilization, Traj. Track.||Inertial Param., Initial State||State, Input||State, Input, Dynamics||Yes||Yes|
|ai-safety-gridworlds ||n/a||Terminal||n/a||Grid Navigation||Initial State, Reward||State||Dynamics, Adversaries||No||No|
|safety-gym ||MuJoCo||OpenGL||Point, Car, Quadruped||Navig., Push Buttons, Box||Initial State||State||Adversaries||Yes||No|
|realworldrl-suite ||MuJoCo||OpenGL||Cart-pole to Humanoid||Stabilization, Locomotion||Inertial Param., Initial State||State||State, Input, Reward||No||No|
we provide open-source simulation environments with a novel, augmented Gym API—with symbolic dynamics, trajectory generation, and quadratic cost—designed to seamlessly interface with both RL and control approaches;
safe-control-gym allows specifying constraints and the randomization of a robot’s initial state and inertial properties through a portable configuration system—this is crucial to simplify the development and comparison of safe learning-based control approaches;
finally, our codebase includes open-source implementations of several baselines from traditional control, RL, and learning-based control (that we use to demonstrate how safe-control-gym supports insightful quantitative comparisons across fields).
Ii Related Work
Simulation environments such as OpenAI’s gym  and DeepMind’s dm_control have been proposed as a way to standardize the development of RL algorithms. However, these often comprise toy or highly abstracted problems that do not necessarily support meaningful comparisons with traditional control approaches. Furthermore, recent work  has highlighted that, even using these tools, RL research is often difficult to reproduce as it might hinge on careful hyper-parameterizations or random seeds.
Faster and more accurate physics-based simulators—such as Google’s Brax  and Nvidia’s Isaac Gym —are becoming increasingly more popular in robotics research . While MuJoCo has been the dominant force behind many of the physics-based RL environments, it is not an open-source project. In this work, we leverage the Python bindings of the open-source C++ Bullet Physics  engine instead—which currently powers several re-implementations of original MuJoCo’s tasks as well as additional robotic simulations, including quadrotors  and quadrupeds.
The aspect of safety has been touched upon by previous RL environments suites although, we believe, in ways not entirely satisfactory for the development of safe robot control. DeepMind’s ai-safety-gridworlds , is a set of RL environments meant to assess the safety properties (including distributional shift, robustness to adversaries, and safe exploration) of intelligent agents. However, it is not specific to robotics, as these environments are purely grid worlds. OpenAI’s safety-gym  and Google’s realworldrl_suite  both augment typical RL environments with constraint evaluation. They include a handful of—albeit simplified—robotic platforms such as 2-wheeled robots and quadrupeds. Similarly to our work, realworldrl_suite  also includes perturbations in actions, observations, and physical quantities. However, unlike our work, [8, 9] leverage MuJoCo and lack the support for a symbolic framework to express a priori knowledge of a system’s dynamics or its constraints.
While safe-control-gym also includes a Gym-style quadrotor environment, it is worth clarifying that this is especially intended for safe, low-level control rather than vision-based applications—like AirSim  or Flightmare —or multi-agent coordination—like gym-pybullet-drones .
Our work advances the state-of-the-art (summarized in Table I) by providing (i), for the first time, symbolic models of the dynamics, cost, and constraints of an RL environment (to support traditional control and model-based approaches); (ii) customizable, portable, and reusable constraints and physics disturbances (to facilitate comparisons and enhance repeatability); (iii) traditional control and learning-based control baselines (beyond just RL baselines).
Iii Bridging Reinforcement Learning and Learning-based Control Research
As pointed out in [16, 4], despite the undeniable similarities in their setup, there still exist terminology gaps and disconnects in how optimal control and reinforcement learning research address safe robot control. In , as we reviewed the last half-decade of research in safe robot control, we observed significant differences in the use and reliance on prior models and assumptions. We also found a distinct lack of open-source simulations and control implementations—which are essential for repeatability and comparisons across fields and methodologies. With this work, we intend to make it easier for both RL and control researchers to (i) publish their results based on open-source simulations, (ii) easily compare against both RL and traditional control baselines, and (iii) quantify safety against shared sets of constraints or dynamics disturbances.
Our open-source suite safe-control-gym comprises 3 dynamical systems based on 2 platforms (cart-pole, 1D, and 2D quadrotors) and 2 control tasks (stabilization and trajectory tracking). It can be downloaded and installed as:
As advised in , “benchmark problems should be complex enough to highlight issues in controller design […] but simple enough to provide easily understood comparisons.”. We include the cart-pole as a dynamic system that has been widely popular and adopted to showcase traditional control as well as RL since the mid-80s . All three systems in safe-control-gym are unstable. The 1D quadrotor is linear, the 2D one is nonlinear and the cart-pole is non-minimum phase. The 1D quadrotor is a simpler system that can also be used for didactic purposes.
Iv-a Cart-pole System
A description of the cart-pole system is given in Figure 3: a cart with mass connects via a prismatic joint to a 1D track; a pole, of mass and length , is hinged to the cart. The state vector for the cart-pole is , where is the horizontal position of the cart, is the velocity of the cart, is the angle of the pole with respect to vertical, and is the angular velocity of the pole. The input to the system is a force , applied to the center of mass (COM) of the cart. In the frictionless case, the equations of motion for the cart-pole system are given in  as:
where the subscript denotes continuous time and is the acceleration due to gravitation.
Iv-B 1D and 2D Quadrotor Systems
The second and third robotic systems in safe-control-gym are the 1D and the 2D quadrotor. These correspond to the cases in which the movement of a quadrotor is constrained to the 1D motion in the -direction and the 2D motion in the -plane, respectively. For a physical quadrotor, these motions can be achieved by setting the four motor thrusts of the quadrotor to balance out the force and torque along the redundant dimensions (i.e. identically, for the 1D case, or identically with respect to the -plane symmetry, for the 2D case). Schematics of the 1D and 2D quadrotor environments are given in Figure 3.
In the 1D quadrotor case, the state of the system is , where and are the vertical position and velocity of the COM of the quadrotor. The input to the system is the overall thrust generated by the motors of the quadrotor. The equation of motion for the 1D quadrotor system is
where is the mass of the quadrotor and is the acceleration due to gravitation.
In the 2D quadrotor case, the state of the system is , where and are the translation position and velocity of the COM of the quadrotor in the -plane, and and are the pitch angle and the pitch angle rate, respectively. The input of the system are the thrusts generated by two pairs (one on each side of the body’s -axis) of motors. The equations of motion for the 2D quadrotor system are as follows:
where is the mass of the quadrotor, is the acceleration due to gravitation,
is the effective moment arm (withbeing the arm length of the quadrotor, i.e., the distance from each motor pair to the COM), and is the moment of inertia about the -axis.
Iv-C Stabilization and Trajectory-tracking Tasks
All three systems in Sections IV-A and IV-B can be assigned to one of two control tasks: (i) stabilization and (ii) trajectory tracking. In RL, an agent/controller’s performance is expressed by the total collected reward . The traditional reward function for cart-pole stabilization [18, 10] is simply a positive instantaneous reward for each time step in which the pole is upright (as episodes are terminated when exceeds threshold ):
For control-based approaches, safe-control-gym allows to replace the RL reward with the quadratic cost:
where , is an equilibrium pair for the system to which we want to stabilize and , are parameters of the cost function. The negated quadratic cost can also be used as the RL reward for the quadrotor stabilization task.
For trajectory tracking, safe-control-gym includes a trajectory generation module capable of generating circular, sinusoidal, lemniscate, or square trajectories for episodes with an arbitrary length of control steps. The module returns references , . To run a quadrotor example, tracking different trajectories, try:
Iv-D safe-control-gym Extended API
To provide native support to open-source RL libraries, safe-control-gym adopts OpenAI Gym’s interface. However, to the best of our knowledge, we are the first to extend this API with the ability to provide a learning agent/controller with a priori knowledge of the dynamical system. This is of fundamental importance to also support the development of and comparison with learning-based control approaches (that typically leverage insights about the physics of a robotic system). We believe this prior information should not be discarded but rather integrated into the learning process. An overview of safe-control-gym’s features—and how to interact with a learning-based controller—is presented in Figure 2. Our benchmark suite can be used, for example, to answer the question of how much data-efficiency—which is crucial in robot learning—is forfeited by model-free RL approaches that do not exploit prior knowledge (see Section VI). To run one of safe-control-gym’s environments—in headless mode—with printouts from the original Gym API (in blue) and our new API (in red), try:
Iv-D1 Symbolic Models
We use CasADi , an open-source symbolic framework for nonlinear optimization and algorithmic differentiation, to include symbolical models of (i) our systems’ a priori dynamics—i.e., those in Sections IV-A and IV-B, not accounting for the disturbances in IV-D3—as well as (ii) the quadratic cost function from Section IV-C and (iii) optional constraints (see Section IV-D2). As shown by the printouts of the snippet above, these models, together with the initial state of the system and task references , , are exposed by our API in a reset_info dictionary returned by each reset of an environment.
The ability to specify, evaluate, and enforce one or more constraints on state and input :
is essential for safe robot control. While previous RL environments including state constraints exist [8, 9], our implementation is the first to also provide (i) their symbolic representation and (ii) the ability to create bespoke ones while creating an environment (see Section IV-D4). Our current implementation includes default constraints and supports user-specified ones in multiple forms (linear, bounded, quadratic) on either the system’s state, input, or both. Constraint evaluations are included in the info dictionary returned at each environment’s step.
In developing safe control approaches, we are often confronted with the fact that models like the ones in Sections IV-A and IV-B are not a complete or fully truthful representation of the system under test. safe-control-gym provides several ways to implement non-idealities that mimic real-life robots, including:
the randomization (from a given probability distribution) of the initial state of the system,;
the randomization (from given probability distributions) of the inertial parameters—i.e.,, , for the cart-pole and , for the quadrotor;
disturbances (in the form of white noise, step, or impulse) applied to the action inputsent from the controller to the robot;
disturbances (in the form of white noise, step, or impulse) applied to the observations of the state returned by an environment to the controller;
dynamics disturbances, including additional forces applied to a robot using PyBullet APIs; these can also be set deterministically from outside the environment, e.g., to implement adversarial training as in .
Iv-D4 Configuration System
Iv-E Computational Performance
Because deep learning methods can be especially data-hungry—and the ability to collect experimental datasets or generate simulated ones is one of the bottlenecks of learning-based robotics—we assessed the computational performance ofsafe-control-gym on a system with a 2.30GHz Quad-Core i7-1068NG7 CPU, 32GB 3733MHz LPDDR4X of memory, and running Python 3.7 under macOS 11. Table II summarizes the obtained simulation speed-ups (with respect to the wall-clock) for the cart-pole and 2D quadrotor environment, in headless mode or using the GUI, with or without constraint evaluation, and different choices of control and physics integration frequencies. In headless mode, a single instance of safe-control-gym allows to collect data 10 to 20 times faster than in real life, with accurate physics stepped by PyBullet at 1000Hz.
V Control Algorithms
The codebase of safe-control-gym also comprises an array of implementations of control approaches, from traditional control to safety certified control, passing by learning-based control and safe reinforcement learning.
V-a Control and Safe Control Baselines
As baselines, our benchmark suite includes standard state-feedback control approaches such as the linear quadratic regulator (LQR) and iterative LQR (iLQR). The LQR controller deals with systems having linear dynamics (as in (2)) and quadratic cost (as in (5)). For nonlinear systems (e.g., the 2D quadrotor in (3) and cart-pole in (1)), the LQR controller uses local linear approximations of the nonlinear dynamics. The iLQR controller is similar to the LQR but iteratively improves the performance by finding better local approximations of the cost function (5) and system dynamics using the state and input trajectories from the previous iteration. All the environments in safe-control-gym expose the symbolic model of an a priori dynamics, facilitating the computation of its Jacobians and the Jacobians and Hessians of the cost function. While we include LQR and iLQR to showcase the model-based aspect of our benchmark, the symbolic expressions of the first-order and second-order terms included in each environment can be equivalently leveraged by other model-based control approaches.
We also include two predictive control baselines: Linear Model Predictive Control (LMPC) and Nonlinear Model Predictive Control (NMPC) . At every control step, Model Predictive Control (MPC) solves a constrained optimization problem to find a control input sequence, over a finite horizon, that minimizes the cost of the system’s predicted dynamics—possibly subject to input and state constraints. Then, the first optimal control input from the sequence is applied. While NMPC uses the nonlinear system model, LMPC uses the linearized approximation to predict the evolution of the system, sacrificing prediction accuracy for computational efficiency. In our codebase, CasADi’s opti framework is used to formulate the optimization problem. As explained in Section IV-D, safe-control-gym provides all the system’s components required by MPC (a priori dynamics, constraints, cost function) as CasADi models.
V-B Reinforcement Learning Baselines
As safe-control-gym extends the original Gym API, any compatible RL algorithm can directly be applied to our environments. In our codebase, we include two of the most well-known RL baselines: Proximal Policy Optimization (PPO)  and Soft Actor-Critic (SAC) 
. These are model-free approaches that map sensor/state measurements to control inputs (without leveraging a dynamics model) using neural network (NN)-based policies. Both PPO and SAC have been shown to work on a wide range of simulated robotics tasks, some of which involve complex dynamics. We adapt their implementations fromstable-baselines3  and OpenAI’s Spinning Up, with a few modifications to also support our suite’s configuration systems. PPO and SAC are not natively safety-aware approaches and do not guarantee constraint satisfaction nor robustness (beyond the generalization properties of NNs).
V-C Safe Learning-based Control
Safe learning-based control approaches improve a robot’s performance using past data to improve the estimate of a system’s true dynamics as well as providing guarantees on stability and/or constraint satisfaction. One of these approaches, included in safe-control-gym, is GP-MPC . This method models uncertain dynamics using a Gaussian process (GP) which it uses to better predict the future evolution of the system as well as tighten constraints, based on the confidence of the dynamics along the prediction. GP-MPC has been demonstrated for the control of ground-based mobile robots . Our implementation leverages the LMPC controller, based on the environments’ symbolic a priori model, and uses gpytorch for the GP modelling and optimization. GP-MPC can accommodate both environment and controller-specific constraints.
V-D Safe and Robust Reinforcement Learning
Building upon the RL baselines, we implemented three safe RL approaches that address the problems of constraint satisfaction and robust generalization. The safety layer-based approach in  pre-trains NN models to approximate linearized state constraints. These learned constraints are then used to filter potentially unsafe inputs from an RL controller via least-squares projection. We add such a safety layer to PPO and apply it to our benchmark tasks with simple bound constraints. Robust RL aims to learn policies that generalize across systems or tasks. We adapt two methods based on adversarial learning: RARL  and RAP . These model dynamics disturbances as a learning adversary and train the policy against increasingly stronger ones. The resulting controllers are shown, in simulation [20, 28], to be robust against parameter mismatch. These methods can be directly trained in safe-control-gym, thanks to its dynamics disturbances API (see Section IV-D3).
V-E Safety Certification of Learned Controllers
Learned controllers lacking formal guarantees can be rendered safe by safety filters. These filters minimally modify unsafe control inputs, so that the applied control input maintains the system’s state within a safe set. Model predictive safety certification (MPSC) uses a finite-horizon constrained optimization problem with a discrete-time predictive model to prevent a learning-based controller from violating constraints . In , we presented an implementation of MPSC for PPO simultaneously leveraging the CasADi a priori dynamics and constraints and Gym RL interface of safe-control-gym.
Control barrier functions (CBF) are safety filters for continuous-time nonlinear control-affine systems using quadratic programming (QP) with a constraint on the CBF’s time derivative with respect to the system dynamics . In the case of model errors, the resulting errors in the CBF’s time derivative can be learned by an NN . Learning-based CBF filters have been applied to safely control a segway  and a quadrotor . Again, our CBF implementation relies on the a priori model and constraints exposed by the API of safe-control-gym. The CBF’s time derivative is also efficiently determined using CasADi. Constraints can be handled as long as the constraint set contains the CBF’s superlevel set.
To demonstrate how our work supports the development and test of all the families of control algorithms discussed in Section V, we present their control performance (Figures 4 and 5), learning efficiency (Figure 6), and constraint satisfaction (Figure 7) across identical safe-control-gym task environments. Figure 8 also demonstrates how to use our suite to test a controller’s robustness to disturbances and parametric uncertainty.
As we did not focus on each approach’s parameter tuning, the goal here is not to claim superiority of one approach over the other but rather to show how safe-control-gym allows to plot RL and control results on a common set of axes.
Vi-a Control Performance
In Figures 4 and 5, we show that the LQR (with the true and overestimated by 50% parameters), GP-MPC, PPO, and SAC are able to stabilize the cart-pole and track the quadrotor trajectory reference. For the stabilization task, GP-MPC closely matches the closed-loop trajectory of the LQR with true parameters, albeit its a priori model was the same one given to the LQR with overestimated parameters (). This shows how GP-MPC can overcome imperfect initial knowledge through learning. Both PPO and SAC yield substantially different closed-loop trajectories when compared to LQR and GP-MPC. This is likely a result of the difference in reward (4) and cost functions (5), for the stabilization task, and RL observations, for trajectory tracking. Indeed, the choices made in the expression of the objective add a layer of complexity to the equitable comparison of RL and learning-based control. Tracking the sinusoidal trajectories (Figure 5) introduces low-frequency oscillations in and for PPO and SAC. GP-MPC’s planning horizon, on the other hand, effectively avoids these.
Vi-B Learning Performance and Data Efficiency
Figure 6 shows how much data GP-MPC, PPO, and SAC require to achieve comparable performance on an identical evaluation cost. This plot showcases the type of interdisciplinary comparisons enabled by safe-control-gym. In both plots, the untrained GP-MPC displays a performance that is only matched after and seconds of simulated data by the RL approaches. GP-MPC converges to its optimal performance with roughly one tenth of the data. This highlights how learning-based control approaches are orders of magnitude more data-efficient than model-free RL. However, this is largely the result of knowing a reasonable a priori model (whether accurate or not). The evaluation costs of PPO and SAC exhibit large oscillations and learning instability, not uncommon in deep RL . Once converged SAC and PPO reach performance comparable to GP-MPC in the stabilization task. PPO also matches, but not securely, GP-MPC on the tracking task.
Vi-C Safety: Constraint Satisfaction
In Figure 7, we investigate the impact of learning and training data on the constraint violations of a learning-based controller or safe RL agent. The top plot summarizes the data efficiency of these approaches on the cart-pole stabilization task. Again, leveraging an a priori model, GP-MPC and the learning-based CBF require much fewer training examples to minimize the number of constraint violations than PPO with a safety layer. After training, GP-MPC, learning-based CBF, and safety layer PPO all achieve similar constraint satisfaction performance. Vanilla PPO also reduces the number of constraint violations but cannot match the performance of the GP-MPC and the learning-based CBF.
The bottom plot of Figure 7 shows reduced constraint violations for GP-MPC and PPO with a safety layer for the 2D quadrotor tracking task. Compared to a linear MPC with overestimated parameters, the GP-MPC meets the constraints, finding a compromise between performance and constraint satisfaction. PPO with a safety layer, on the other hand, neither tracks the desired trajectory nor is it able to fully guarantee constraint satisfaction.
Vi-D Safety: Robustness
Figure 8 shows how robust controllers and RL agents are with respect to parametric uncertainty (in the pole length) and white noise (on the input) for cart-pole stabilization. is trained with pole length randomization and improves the robustness of baseline PPO. RAP is trained against adversarial input disturbances and shows robust performance to input noise, as expected, but not to parameter mismatch. Model-based approaches, LQR and GP-MPC, appears less affected to parameter uncertainty than model-free RL but are equally or more hindered by input noise.
Vii Conclusions and Future Work
In this letter, we introduced safe-control-gym, a suite of simulation and evaluation environments for safe learning-based control. We were motivated by the lack of an easy-to-use software benchmark, exposing all the features required to support the development of approaches from both the RL and control theory communities. In safe-control-gym, we combine (i) a physics engine-based simulation with (ii) the description of the available prior knowledge and safety constraints using a symbolic framework. By doing so, we allow the development and test of a wide range of approaches, from model-free RL to learning-based MPC. We believe that safe-control-gym will make it easier for researchers from the RL and control communities to compare their progress, especially for the quantification of safety and robustness. Our next steps will include the extension of safe-control-gym to more robotic platforms, tasks, and additional safe learning-based control approaches implementations.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs Program, the CIFAR AI Chair, and Mitacs’s Elevate Fellowship program.
-  V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State, “Isaac gym: High performance gpu-based physics simulation for robot learning,” arXiv:2108.10470 [cs.RO], 2021.
-  C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax – a differentiable physics engine for large scale rigid body simulation,” arXiv:2106.13281 [cs.RO], 2021.
-  W. Tan-White, “Introducing intrinsic,” Jul 2021. [Online]. Available: https://blog.x.company/introducing-intrinsic-1cf35b87651
-  L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,” Annual Review of Control, Robotics, and Autonomous Systems, vol. to appear, 2021.
-  E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
-  J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi – A software framework for nonlinear optimization and optimal control,” Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019.
-  J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “Ai safety gridworlds,” arXiv:1711.09883 [cs.LG], 2017.
-  A. Ray, J. Achiam, and D. Amodei, “Benchmarking Safe Exploration in Deep Reinforcement Learning,” https://cdn.openai.com/safexp-short.pdf, 2019.
-  G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester, “An empirical investigation of the challenges of real-world reinforcement learning,” arXiv:2003.11881 [cs.LG], 2021.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv:1606.01540 [cs.LG], 2016.
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep
reinforcement learning that matters,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). Palo Alto, CA: AAAI Press, Apr. 2018.
-  J. Collins, S. Chand, A. Vanderkop, and D. Howard, “A review of physics simulators for robotic applications,” IEEE Access, vol. 9, pp. 51 416–51 431, 2021.
-  J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoellig, “Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control,” arXiv:2103.02142 [cs.RO], 2021.
-  S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics. Springer Int’l Publishing, 2018, pp. 621–635.
-  Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza, “Flightmare: A flexible quadrotor simulator,” in Proc. of the 4th Conference on Robot Learning. Cambridge MA, USA.: PMLR, 2020.
-  B. Recht, “A tour of reinforcement learning: The view from continuous control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, no. 1, pp. 253–279, 2019.
-  J. P. How, “Benchmarks [from the editor],” IEEE Control Systems Magazine, vol. 35, no. 1, pp. 6–7, 2015.
-  A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13, no. 5, pp. 834–846, 1983.
-  R. V. Florian, “Correct equations for the dynamics of the cart-pole system,” 2007.
-  L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning. N.p.: PMLR, 06–11 Aug 2017, vol. 70, pp. 2817–2826.
-  J. Buchli, F. Farshidian, A. Winkler, T. Sandy, and M. Giftthaler, “Optimal and learning control for autonomous robots,” arXiv:1708.09342 [cs.SY], 2017.
-  J. B. Rawlings, D. Q. Mayne, and M. M. Diehl, Model Predictive Control: Theory, Computation, and Design. Nob Hill Publishing, 2020, vol. 2nd.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347 [cs.LG], 2017.
-  T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80. N.p.: PMLR, 10–15 Jul 2018, pp. 1861–1870.
-  A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann, “Stable baselines3,” https://github.com/DLR-RM/stable-baselines3, 2019.
-  L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predictive control using gaussian process regression,” IEEE Transactions on Control Systems Technology, vol. 28, no. 6, pp. 2736–2743, 2020.
-  G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv:1801.08757 [cs.AI], 2018.
-  E. Vinitsky, Y. Du, K. Parvate, K. Jang, P. Abbeel, and A. Bayen, “Robust reinforcement learning using adversarial populations,” arXiv:2008.01825 [cs.LG], 2020.
-  K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification for learning-based control,” in 2018 IEEE Conference on Decision and Control (CDC). Piscataway, NJ: IEEE, 2018, pp. 7130–7135.
-  A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in 2019 18th European Control Conference (ECC). Piscataway, NJ: IEEE, 2019, pp. 3420–3431.
-  A. Taylor, A. Singletary, Y. Yue, and A. Ames, “Learning for safety-critical control with control barrier functions,” in Proceedings of the 2nd Conference on Learning for Dynamics and Control. N.p.: PMLR, 10–11 Jun 2020, vol. 120, pp. 708–717.
-  L. Wang, E. A. Theodorou, and M. Egerstedt, “Safe learning of quadrotor dynamics using barrier certificates,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ: IEEE, 2018, pp. 2460–2465.