safe-control-gym: a Unified Benchmark Suite for Safe Learning-based Control and Reinforcement Learning

09/13/2021 ∙ by Zhaocong Yuan, et al. ∙ 62

In recent years, reinforcement learning and learning-based control – as well as the study of their safety, crucial for deployment in real-world robots – have gained significant traction. However, to adequately gauge the progress and applicability of new results, we need the tools to equitably compare the approaches proposed by the controls and reinforcement learning communities. Here, we propose a new open-source benchmark suite, called safe-control-gym. Our starting point is OpenAI's Gym API, which is one of the de facto standard in reinforcement learning research. Yet, we highlight the reasons for its limited appeal to control theory researchers – and safe control, in particular. E.g., the lack of analytical models and constraint specifications. Thus, we propose to extend this API with (i) the ability to specify (and query) symbolic models and constraints and (ii) introduce simulated disturbances in the control inputs, measurements, and inertial properties. We provide implementations for three dynamic systems – the cart-pole, 1D, and 2D quadrotor – and two control tasks – stabilization and trajectory tracking. To demonstrate our proposal – and in an attempt to bring research communities closer together – we show how to use safe-control-gym to quantitatively compare the control performance, data efficiency, and safety of multiple approaches from the areas of traditional control, learning-based control, and reinforcement learning.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robots carry the promise of being the future backbone of transport, warehousing, and manufacturing. However, for ubiquitous robotics to materialize, we need to devise methods to develop robotic controllers faster and autonomously—leveraging machine learning and scaling up current design approaches. Top computing hardware and software companies (including Nvidia 

[1], Google [2], and intrinsic [3]) are now working towards fast physics-based simulations for robot learning. At the same time, because safety is a crucial component of cyber-physical systems operating in the real world, safe learning-based control and safe reinforcement learning (RL) have become bustling areas of academic research over the past few years [4].

Nonetheless, the fast-paced progress of the field risks exacerbating some of the open problems of safe learning control. The continuous influx of new contributions can hamper the ability to discern the more significant results. We need to establish ways to fairly compare results yielded by learning-based controllers that leverage very different methodologies (as well as shared tools for the development and debugging of these controllers). We also need shared definitions and—more importantly—quantitative benchmarks to assess these controllers’ safety and robustness.

Fig. 1: safe-control-gym allows the effortless configuration of uncertain inertial properties as well as the specification of constraints and reference trajectories, essential for the development of safe control algorithms.

Our work was motivated by the lack of open-source, re-usable tools for the comparison of learning-based control research and RL, as observed in [4]. While we acknowledge the importance of eventual real-robot experiments, here, we focus on simulation as a way to lower the barrier of entry and appeal to a larger fraction of the control and RL communities. To develop safe learning-based robot control, we need a simulation API that can (i) support model-based approaches, (ii) express safety constraints, and (iii)

capture real-world non-idealities (such as uncertain physical properties and state estimation). Our ambition is that our software could bring closer, support, and speed up the work of control and RL researchers, allowing them to easily compare results. We strive for simple, modular, and reusable code which leverages two open-source tools popular with each of the two communities: PyBullet’s physics engine 

[5] and CasADi’s symbolic framework [6].

Physics Rendering Robots Tasks Uncertain Constraints Disturbances Gym Symb.
Engine Engine Conditions API API
safe-control-gym Bullet TinyRenderer, OpenGL Cart-pole, Quadrotor Stabilization, Traj. Track. Inertial Param., Initial State State, Input State, Input, Dynamics Yes Yes
ai-safety-gridworlds [7] n/a Terminal n/a Grid Navigation Initial State, Reward State Dynamics, Adversaries No No
safety-gym [8] MuJoCo OpenGL Point, Car, Quadruped Navig., Push Buttons, Box Initial State State Adversaries Yes No
realworldrl-suite [9] MuJoCo OpenGL Cart-pole to Humanoid Stabilization, Locomotion Inertial Param., Initial State State State, Input, Reward No No
TABLE I: Feature comparison of safe-control-gym and other safety-oriented reinforcement learning environments

The contributions and features of safe-control-gym111 (Figure 1) can be summarized as follows:

  • we provide open-source simulation environments with a novel, augmented Gym API—with symbolic dynamics, trajectory generation, and quadratic cost—designed to seamlessly interface with both RL and control approaches;

  • safe-control-gym allows specifying constraints and the randomization of a robot’s initial state and inertial properties through a portable configuration system—this is crucial to simplify the development and comparison of safe learning-based control approaches;

  • finally, our codebase includes open-source implementations of several baselines from traditional control, RL, and learning-based control (that we use to demonstrate how safe-control-gym supports insightful quantitative comparisons across fields).

Ii Related Work

Simulation environments such as OpenAI’s gym [10] and DeepMind’s dm_control have been proposed as a way to standardize the development of RL algorithms. However, these often comprise toy or highly abstracted problems that do not necessarily support meaningful comparisons with traditional control approaches. Furthermore, recent work [11] has highlighted that, even using these tools, RL research is often difficult to reproduce as it might hinge on careful hyper-parameterizations or random seeds.

Faster and more accurate physics-based simulators—such as Google’s Brax [2] and Nvidia’s Isaac Gym [1]—are becoming increasingly more popular in robotics research [12]. While MuJoCo has been the dominant force behind many of the physics-based RL environments, it is not an open-source project. In this work, we leverage the Python bindings of the open-source C++ Bullet Physics [5] engine instead—which currently powers several re-implementations of original MuJoCo’s tasks as well as additional robotic simulations, including quadrotors [13] and quadrupeds.

The aspect of safety has been touched upon by previous RL environments suites although, we believe, in ways not entirely satisfactory for the development of safe robot control. DeepMind’s ai-safety-gridworlds [7], is a set of RL environments meant to assess the safety properties (including distributional shift, robustness to adversaries, and safe exploration) of intelligent agents. However, it is not specific to robotics, as these environments are purely grid worlds. OpenAI’s safety-gym [8] and Google’s realworldrl_suite [9] both augment typical RL environments with constraint evaluation. They include a handful of—albeit simplified—robotic platforms such as 2-wheeled robots and quadrupeds. Similarly to our work, realworldrl_suite [9] also includes perturbations in actions, observations, and physical quantities. However, unlike our work, [8, 9] leverage MuJoCo and lack the support for a symbolic framework to express a priori knowledge of a system’s dynamics or its constraints.

While safe-control-gym also includes a Gym-style quadrotor environment, it is worth clarifying that this is especially intended for safe, low-level control rather than vision-based applications—like AirSim [14] or Flightmare [15]—or multi-agent coordination—like gym-pybullet-drones [13].

Our work advances the state-of-the-art (summarized in Table I) by providing (i), for the first time, symbolic models of the dynamics, cost, and constraints of an RL environment (to support traditional control and model-based approaches); (ii) customizable, portable, and reusable constraints and physics disturbances (to facilitate comparisons and enhance repeatability); (iii) traditional control and learning-based control baselines (beyond just RL baselines).

Iii Bridging Reinforcement Learning and Learning-based Control Research

Fig. 2: Block diagram of safe-control-gym’s Python module architecture, highlighting the backward compatibility with OpenAI’s Gym API (teal components) and the extended API (blue components) we propose for the development of safe learning-based control and reinforcement learning.

As pointed out in [16, 4], despite the undeniable similarities in their setup, there still exist terminology gaps and disconnects in how optimal control and reinforcement learning research address safe robot control. In [4], as we reviewed the last half-decade of research in safe robot control, we observed significant differences in the use and reliance on prior models and assumptions. We also found a distinct lack of open-source simulations and control implementations—which are essential for repeatability and comparisons across fields and methodologies. With this work, we intend to make it easier for both RL and control researchers to (i) publish their results based on open-source simulations, (ii) easily compare against both RL and traditional control baselines, and (iii) quantify safety against shared sets of constraints or dynamics disturbances.

Iv Environments

Fig. 3:

Schematics, state and input vectors of the cart-pole, 1D, and 2D quadrotor environments in


Our open-source suite safe-control-gym comprises 3 dynamical systems based on 2 platforms (cart-pole, 1D, and 2D quadrotors) and 2 control tasks (stabilization and trajectory tracking). It can be downloaded and installed as:

$ clone -b submission \
$ cd safe-control-gym/
$ pip3 install -e .

As advised in [17], “benchmark problems should be complex enough to highlight issues in controller design […] but simple enough to provide easily understood comparisons.”. We include the cart-pole as a dynamic system that has been widely popular and adopted to showcase traditional control as well as RL since the mid-80s [18]. All three systems in safe-control-gym are unstable. The 1D quadrotor is linear, the 2D one is nonlinear and the cart-pole is non-minimum phase. The 1D quadrotor is a simpler system that can also be used for didactic purposes.

Iv-a Cart-pole System

A description of the cart-pole system is given in Figure 3: a cart with mass connects via a prismatic joint to a 1D track; a pole, of mass and length , is hinged to the cart. The state vector for the cart-pole is , where is the horizontal position of the cart, is the velocity of the cart, is the angle of the pole with respect to vertical, and is the angular velocity of the pole. The input to the system is a force , applied to the center of mass (COM) of the cart. In the frictionless case, the equations of motion for the cart-pole system are given in [19] as:


where the subscript denotes continuous time and is the acceleration due to gravitation.

Iv-B 1D and 2D Quadrotor Systems

The second and third robotic systems in safe-control-gym are the 1D and the 2D quadrotor. These correspond to the cases in which the movement of a quadrotor is constrained to the 1D motion in the -direction and the 2D motion in the -plane, respectively. For a physical quadrotor, these motions can be achieved by setting the four motor thrusts of the quadrotor to balance out the force and torque along the redundant dimensions (i.e. identically, for the 1D case, or identically with respect to the -plane symmetry, for the 2D case). Schematics of the 1D and 2D quadrotor environments are given in Figure 3.

In the 1D quadrotor case, the state of the system is , where and are the vertical position and velocity of the COM of the quadrotor. The input to the system is the overall thrust generated by the motors of the quadrotor. The equation of motion for the 1D quadrotor system is


where is the mass of the quadrotor and is the acceleration due to gravitation.

In the 2D quadrotor case, the state of the system is , where and are the translation position and velocity of the COM of the quadrotor in the -plane, and and are the pitch angle and the pitch angle rate, respectively. The input of the system are the thrusts generated by two pairs (one on each side of the body’s -axis) of motors. The equations of motion for the 2D quadrotor system are as follows:


where is the mass of the quadrotor, is the acceleration due to gravitation,

is the effective moment arm (with

being the arm length of the quadrotor, i.e., the distance from each motor pair to the COM), and is the moment of inertia about the -axis.

Iv-C Stabilization and Trajectory-tracking Tasks

All three systems in Sections IV-A and IV-B can be assigned to one of two control tasks: (i) stabilization and (ii) trajectory tracking. In RL, an agent/controller’s performance is expressed by the total collected reward . The traditional reward function for cart-pole stabilization [18, 10] is simply a positive instantaneous reward for each time step in which the pole is upright (as episodes are terminated when exceeds threshold ):


For control-based approaches, safe-control-gym allows to replace the RL reward with the quadratic cost:


where , is an equilibrium pair for the system to which we want to stabilize and , are parameters of the cost function. The negated quadratic cost can also be used as the RL reward for the quadrotor stabilization task.

For trajectory tracking, safe-control-gym includes a trajectory generation module capable of generating circular, sinusoidal, lemniscate, or square trajectories for episodes with an arbitrary length of control steps. The module returns references , . To run a quadrotor example, tracking different trajectories, try:

$ cd safe-control-gym/examples/
$ python3  --overrides tracking.yaml

The quadratic cost of trajectory tracking, is computed as in (5), replacing , with ,

. Again, the negated cost also serves as the RL reward function. The RL state is further augmented with the target position in the trajectory to define a valid Markov decision process.

Iv-D safe-control-gym Extended API

To provide native support to open-source RL libraries, safe-control-gym adopts OpenAI Gym’s interface. However, to the best of our knowledge, we are the first to extend this API with the ability to provide a learning agent/controller with a priori knowledge of the dynamical system. This is of fundamental importance to also support the development of and comparison with learning-based control approaches (that typically leverage insights about the physics of a robotic system). We believe this prior information should not be discarded but rather integrated into the learning process. An overview of safe-control-gym’s features—and how to interact with a learning-based controller—is presented in Figure 2. Our benchmark suite can be used, for example, to answer the question of how much data-efficiency—which is crucial in robot learning—is forfeited by model-free RL approaches that do not exploit prior knowledge (see Section VI). To run one of safe-control-gym’s environments—in headless mode—with printouts from the original Gym API (in blue) and our new API (in red), try:

$ cd safe-control-gym/examples/
$ python3 --system cartpole \
> --overrides verbose_api.yaml
Environment GUI Control PyBullet Constr.& Speed-up
Freq. Freq. Disturb.
cartpole Yes Hz Hz No
cartpole No Hz Hz No
cartpole No Hz Hz Yes
quadrotor Yes Hz Hz No
quadrotor No Hz Hz No
quadrotor No Hz Hz Yes

Running the environment with default constraints and disturbances
2.30GHz Quad-Core i7-1068NG7; 32GB 3733MHz LPDDR4X

TABLE II: Simulation speed-ups for varying configurations

Iv-D1 Symbolic Models

We use CasADi [6], an open-source symbolic framework for nonlinear optimization and algorithmic differentiation, to include symbolical models of (i) our systems’ a priori dynamics—i.e., those in Sections IV-A and IV-B, not accounting for the disturbances in IV-D3—as well as (ii) the quadratic cost function from Section IV-C and (iii) optional constraints (see Section IV-D2). As shown by the printouts of the snippet above, these models, together with the initial state of the system and task references , , are exposed by our API in a reset_info dictionary returned by each reset of an environment.

Iv-D2 Constraints

The ability to specify, evaluate, and enforce one or more constraints on state and input :


is essential for safe robot control. While previous RL environments including state constraints exist [8, 9], our implementation is the first to also provide (i) their symbolic representation and (ii) the ability to create bespoke ones while creating an environment (see Section IV-D4). Our current implementation includes default constraints and supports user-specified ones in multiple forms (linear, bounded, quadratic) on either the system’s state, input, or both. Constraint evaluations are included in the info dictionary returned at each environment’s step.

Iv-D3 Disturbances

In developing safe control approaches, we are often confronted with the fact that models like the ones in Sections IV-A and IV-B are not a complete or fully truthful representation of the system under test. safe-control-gym provides several ways to implement non-idealities that mimic real-life robots, including:

  • the randomization (from a given probability distribution) of the initial state of the system,


  • the randomization (from given probability distributions) of the inertial parameters—i.e.,

    , , for the cart-pole and , for the quadrotor;

  • disturbances (in the form of white noise, step, or impulse) applied to the action input

    sent from the controller to the robot;

  • disturbances (in the form of white noise, step, or impulse) applied to the observations of the state returned by an environment to the controller;

  • dynamics disturbances, including additional forces applied to a robot using PyBullet APIs; these can also be set deterministically from outside the environment, e.g., to implement adversarial training as in [20].

Iv-D4 Configuration System

To facilitate the reproduction and portability of experiments with identical environment setups, safe-control-gym includes a YAML configuration system that supports all of the features discussed in Sections IV-C and IV-D:

    cost: quadratic  # The reward function.
    ctrl_freq: 50    # The control input frequency.
    pyb_freq: 1000   # PyBullet’s stepping freq.
    constraints:     # Constraints on the system.
        - constraint_form: bounded_constraint
          lower_bounds: [-1, -0.2, -0.3, -0.05]
          upper_bounds: [1, 0.2, 0.3, 0.05]
          constrained_variable: STATE
    disturbances:    # Disturbances on the system.
            - disturbance_func: white_noise
              std: 0.05

Iv-E Computational Performance

Because deep learning methods can be especially data-hungry—and the ability to collect experimental datasets or generate simulated ones is one of the bottlenecks of learning-based robotics—we assessed the computational performance of

safe-control-gym on a system with a 2.30GHz Quad-Core i7-1068NG7 CPU, 32GB 3733MHz LPDDR4X of memory, and running Python 3.7 under macOS 11. Table II summarizes the obtained simulation speed-ups (with respect to the wall-clock) for the cart-pole and 2D quadrotor environment, in headless mode or using the GUI, with or without constraint evaluation, and different choices of control and physics integration frequencies. In headless mode, a single instance of safe-control-gym allows to collect data 10 to 20 times faster than in real life, with accurate physics stepped by PyBullet at 1000Hz.

V Control Algorithms

The codebase of safe-control-gym also comprises an array of implementations of control approaches, from traditional control to safety certified control, passing by learning-based control and safe reinforcement learning.

V-a Control and Safe Control Baselines

As baselines, our benchmark suite includes standard state-feedback control approaches such as the linear quadratic regulator (LQR) and iterative LQR (iLQR)[21]. The LQR controller deals with systems having linear dynamics (as in (2)) and quadratic cost (as in (5)). For nonlinear systems (e.g., the 2D quadrotor in (3) and cart-pole in (1)), the LQR controller uses local linear approximations of the nonlinear dynamics. The iLQR controller is similar to the LQR but iteratively improves the performance by finding better local approximations of the cost function (5) and system dynamics using the state and input trajectories from the previous iteration. All the environments in safe-control-gym expose the symbolic model of an a priori dynamics, facilitating the computation of its Jacobians and the Jacobians and Hessians of the cost function. While we include LQR and iLQR to showcase the model-based aspect of our benchmark, the symbolic expressions of the first-order and second-order terms included in each environment can be equivalently leveraged by other model-based control approaches.

We also include two predictive control baselines: Linear Model Predictive Control (LMPC) and Nonlinear Model Predictive Control (NMPC) [22]. At every control step, Model Predictive Control (MPC) solves a constrained optimization problem to find a control input sequence, over a finite horizon, that minimizes the cost of the system’s predicted dynamics—possibly subject to input and state constraints. Then, the first optimal control input from the sequence is applied. While NMPC uses the nonlinear system model, LMPC uses the linearized approximation to predict the evolution of the system, sacrificing prediction accuracy for computational efficiency. In our codebase, CasADi’s opti framework is used to formulate the optimization problem. As explained in Section IV-D, safe-control-gym provides all the system’s components required by MPC (a priori dynamics, constraints, cost function) as CasADi models.

V-B Reinforcement Learning Baselines

As safe-control-gym extends the original Gym API, any compatible RL algorithm can directly be applied to our environments. In our codebase, we include two of the most well-known RL baselines: Proximal Policy Optimization (PPO) [23] and Soft Actor-Critic (SAC) [24]

. These are model-free approaches that map sensor/state measurements to control inputs (without leveraging a dynamics model) using neural network (NN)-based policies. Both PPO and SAC have been shown to work on a wide range of simulated robotics tasks, some of which involve complex dynamics. We adapt their implementations from

stable-baselines3 [25] and OpenAI’s Spinning Up, with a few modifications to also support our suite’s configuration systems. PPO and SAC are not natively safety-aware approaches and do not guarantee constraint satisfaction nor robustness (beyond the generalization properties of NNs).

V-C Safe Learning-based Control

Safe learning-based control approaches improve a robot’s performance using past data to improve the estimate of a system’s true dynamics as well as providing guarantees on stability and/or constraint satisfaction. One of these approaches, included in safe-control-gym, is GP-MPC [26]. This method models uncertain dynamics using a Gaussian process (GP) which it uses to better predict the future evolution of the system as well as tighten constraints, based on the confidence of the dynamics along the prediction. GP-MPC has been demonstrated for the control of ground-based mobile robots [26]. Our implementation leverages the LMPC controller, based on the environments’ symbolic a priori model, and uses gpytorch for the GP modelling and optimization. GP-MPC can accommodate both environment and controller-specific constraints.

V-D Safe and Robust Reinforcement Learning

Building upon the RL baselines, we implemented three safe RL approaches that address the problems of constraint satisfaction and robust generalization. The safety layer-based approach in [27] pre-trains NN models to approximate linearized state constraints. These learned constraints are then used to filter potentially unsafe inputs from an RL controller via least-squares projection. We add such a safety layer to PPO and apply it to our benchmark tasks with simple bound constraints. Robust RL aims to learn policies that generalize across systems or tasks. We adapt two methods based on adversarial learning: RARL [20] and RAP [28]. These model dynamics disturbances as a learning adversary and train the policy against increasingly stronger ones. The resulting controllers are shown, in simulation [20, 28], to be robust against parameter mismatch. These methods can be directly trained in safe-control-gym, thanks to its dynamics disturbances API (see Section IV-D3).

Fig. 4: Control performance (absolute error from reference , ) on the cart-pole stabilization task for different controllers and RL agents.

V-E Safety Certification of Learned Controllers

Learned controllers lacking formal guarantees can be rendered safe by safety filters. These filters minimally modify unsafe control inputs, so that the applied control input maintains the system’s state within a safe set. Model predictive safety certification (MPSC) uses a finite-horizon constrained optimization problem with a discrete-time predictive model to prevent a learning-based controller from violating constraints [29]. In [4], we presented an implementation of MPSC for PPO simultaneously leveraging the CasADi a priori dynamics and constraints and Gym RL interface of safe-control-gym.

Control barrier functions (CBF) are safety filters for continuous-time nonlinear control-affine systems using quadratic programming (QP) with a constraint on the CBF’s time derivative with respect to the system dynamics [30]. In the case of model errors, the resulting errors in the CBF’s time derivative can be learned by an NN [31]. Learning-based CBF filters have been applied to safely control a segway [31] and a quadrotor [32]. Again, our CBF implementation relies on the a priori model and constraints exposed by the API of safe-control-gym. The CBF’s time derivative is also efficiently determined using CasADi. Constraints can be handled as long as the constraint set contains the CBF’s superlevel set.

Vi Results

To demonstrate how our work supports the development and test of all the families of control algorithms discussed in Section V, we present their control performance (Figures 4 and 5), learning efficiency (Figure 6), and constraint satisfaction (Figure 7) across identical safe-control-gym task environments. Figure 8 also demonstrates how to use our suite to test a controller’s robustness to disturbances and parametric uncertainty.

As we did not focus on each approach’s parameter tuning, the goal here is not to claim superiority of one approach over the other but rather to show how safe-control-gym allows to plot RL and control results on a common set of axes.

Fig. 5: Control performance (absolute error from reference , ) on the 2D quadrotor tracking task for different controllers and RL agents.

Vi-a Control Performance

In Figures 4 and 5, we show that the LQR (with the true and overestimated by 50% parameters), GP-MPC, PPO, and SAC are able to stabilize the cart-pole and track the quadrotor trajectory reference. For the stabilization task, GP-MPC closely matches the closed-loop trajectory of the LQR with true parameters, albeit its a priori model was the same one given to the LQR with overestimated parameters (). This shows how GP-MPC can overcome imperfect initial knowledge through learning. Both PPO and SAC yield substantially different closed-loop trajectories when compared to LQR and GP-MPC. This is likely a result of the difference in reward (4) and cost functions (5), for the stabilization task, and RL observations, for trajectory tracking. Indeed, the choices made in the expression of the objective add a layer of complexity to the equitable comparison of RL and learning-based control. Tracking the sinusoidal trajectories (Figure 5) introduces low-frequency oscillations in and for PPO and SAC. GP-MPC’s planning horizon, on the other hand, effectively avoids these.

Fig. 6: Comparison of the learning regimes (on a logarithmic -axis) of learning-based control (GP-MPC) and RL agents (PPO, SAC).

Vi-B Learning Performance and Data Efficiency

Figure 6 shows how much data GP-MPC, PPO, and SAC require to achieve comparable performance on an identical evaluation cost. This plot showcases the type of interdisciplinary comparisons enabled by safe-control-gym. In both plots, the untrained GP-MPC displays a performance that is only matched after and seconds of simulated data by the RL approaches. GP-MPC converges to its optimal performance with roughly one tenth of the data. This highlights how learning-based control approaches are orders of magnitude more data-efficient than model-free RL. However, this is largely the result of knowing a reasonable a priori model (whether accurate or not). The evaluation costs of PPO and SAC exhibit large oscillations and learning instability, not uncommon in deep RL [11]. Once converged SAC and PPO reach performance comparable to GP-MPC in the stabilization task. PPO also matches, but not securely, GP-MPC on the tracking task.

Fig. 7: In the top plot, fraction of time spent incurring a constraint violation by learning-based control, vanilla RL, safety-augmented RL, and safety-certified control; in the bottom plot, trajectories for an “impossible” tracking task (with constraints narrower than the reference) for traditional control, learning-based control, and safety-augmented RL.

Vi-C Safety: Constraint Satisfaction

In Figure 7, we investigate the impact of learning and training data on the constraint violations of a learning-based controller or safe RL agent. The top plot summarizes the data efficiency of these approaches on the cart-pole stabilization task. Again, leveraging an a priori model, GP-MPC and the learning-based CBF require much fewer training examples to minimize the number of constraint violations than PPO with a safety layer. After training, GP-MPC, learning-based CBF, and safety layer PPO all achieve similar constraint satisfaction performance. Vanilla PPO also reduces the number of constraint violations but cannot match the performance of the GP-MPC and the learning-based CBF.

The bottom plot of Figure 7 shows reduced constraint violations for GP-MPC and PPO with a safety layer for the 2D quadrotor tracking task. Compared to a linear MPC with overestimated parameters, the GP-MPC meets the constraints, finding a compromise between performance and constraint satisfaction. PPO with a safety layer, on the other hand, neither tracks the desired trajectory nor is it able to fully guarantee constraint satisfaction.

Fig. 8: Robustness of cart-pole stabilization policies learned by traditional and learning-based control as well as vanilla and robust RL with domain randomization against variations in the length of the pole and the white noise disturbance applied to action input .

Vi-D Safety: Robustness

Figure 8 shows how robust controllers and RL agents are with respect to parametric uncertainty (in the pole length) and white noise (on the input) for cart-pole stabilization. is trained with pole length randomization and improves the robustness of baseline PPO. RAP is trained against adversarial input disturbances and shows robust performance to input noise, as expected, but not to parameter mismatch. Model-based approaches, LQR and GP-MPC, appears less affected to parameter uncertainty than model-free RL but are equally or more hindered by input noise.

Vii Conclusions and Future Work

In this letter, we introduced safe-control-gym, a suite of simulation and evaluation environments for safe learning-based control. We were motivated by the lack of an easy-to-use software benchmark, exposing all the features required to support the development of approaches from both the RL and control theory communities. In safe-control-gym, we combine (i) a physics engine-based simulation with (ii) the description of the available prior knowledge and safety constraints using a symbolic framework. By doing so, we allow the development and test of a wide range of approaches, from model-free RL to learning-based MPC. We believe that safe-control-gym will make it easier for researchers from the RL and control communities to compare their progress, especially for the quantification of safety and robustness. Our next steps will include the extension of safe-control-gym to more robotic platforms, tasks, and additional safe learning-based control approaches implementations.

Viii Acknowledgments

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs Program, the CIFAR AI Chair, and Mitacs’s Elevate Fellowship program.