Log In Sign Up

Reactive and Safe Road User Simulations using Neural Barrier Certificates

by   Yue Meng, et al.

Reactive and safe agent modelings are important for nowadays traffic simulator designs and safe planning applications. In this work, we proposed a reactive agent model which can ensure safety without comprising the original purposes, by learning only high-level decisions from expert data and a low-level decentralized controller guided by the jointly learned decentralized barrier certificates. Empirical results show that our learned road user simulation models can achieve a significant improvement in safety comparing to state-of-the-art imitation learning and pure control-based methods, while being similar to human agents by having smaller errors to the expert data. Moreover, our learned reactive agents are shown to generalize better to unseen traffic conditions, and react better to other road users and therefore can help understand challenging planning problems pragmatically.


page 5

page 6

page 7


Safe Multi-Agent Reinforcement Learning through Decentralized Multiple Control Barrier Functions

Multi-Agent Reinforcement Learning (MARL) algorithms show amazing perfor...

Safe Policy Synthesis in Multi-Agent POMDPs via Discrete-Time Barrier Functions

A multi-agent partially observable Markov decision process (MPOMDP) is a...

Safety Considerations in Deep Control Policies with Probabilistic Safety Barrier Certificates

Recent advances in Deep Machine Learning have shown promise in solving c...

Closing the Closed-Loop Distribution Shift in Safe Imitation Learning

Commonly used optimization-based control strategies such as model-predic...

Decentralized Safe Reactive Planning under TWTL Specifications

We investigate a multi-agent planning problem, where each agent aims to ...

A multiagent urban traffic simulation

We built a multiagent simulation of urban traffic to model both ordinary...

Deep Structured Reactive Planning

An intelligent agent operating in the real-world must balance achieving ...

I Introduction

Understanding how different road participants (e.g. human drivers, pedestrians, cyclist) act in different traffic scenes plays a crucial role in developing modern self-driving techniques: It can give insights into how humans behave in each scenario, how to learn from human reactions, how to team autonomous and human agents, and can also provide a more realistic simulation environment for autonomous car developers. Most non-ego road participants in today’s self-driving simulators (e.g. CARLA [15], TORCS [42], SUMO [24]) can only “replay” pre-defined trajectories or be guided by carefully designed cost functions and thresholds [6]. Such non-reactive or handcrafted agents are unrealistic, and therefore create a big gap between simulators and real-world scenarios. On the other side, imitation learning (IL) techniques [4, 36, 30, 7, 19] try to “mimic” humans from the existing traffic datasets, by learning actions or reward functions instead of directly enforcing some basic safety properties of the road users (e.g. avoiding collisions, staying within lanes and below speed limits). Therefore, the learned agents may have weird or unexplainable behaviors. See a more detailed discussion in the related works.

In this paper, we start from a different perspective and use a safety-driven learning-based control method to build reactive agent models for a variety of road users in different traffic scenarios. Control-theoretical tools exist to guide the synthesis of desired controllers using a certificate as the guidance. For example, control barrier functions (CBF) can be used to find controllers such that the closed-loop system satisfies safety properties defined by the corresponding barrier function [2, 1]. The biggest challenge of using CBF is the function design: Hand crafting a CBF for systems with complex dynamics and safety requirements is nearly impossible. Inspired by the recent advancement of learning decentralized CBF for the safety of homogeneous multi-agent systems [34], we want to learn only high-level decisions (e.g. when to switch the lane) from data and use neural CBF to guide the (low-level) safe actions of each agent for a group of heterogeneous road users.

Ideally, for a specific type of road user, a CBF that works across all scenarios would give a strong safety certificate. However, in practice, we notice that such a CBF is extremely difficult to learn, as the expected observations (or sensing inputs) of the agent vary significantly in different traffic scenarios. Therefore, we restrict our neural CBF to handle specific scenarios, where each of the traffic datasets can provide agents’ observations in the corresponding scenario. This is analog to having different control policies when handling different road and traffic conditions.

We use the same learning framework to build simulation models for different road users in various traffic scenarios, including dense highway (NGSIM [31]), normal highway (HighD [22]), roundabouts (RounD [23]), mixed pedestrian-cyclist (SDD [35]), and pedestrian-vehicle interaction at intersections (VCI-DUT [43])). We show that our safety-driven learning approach can significantly reduce the unsafe behaviors, comparing to both state-of-the-art IL-based methods, and traditional control methods. As a by-product, our learned road user simulations are similar to expert behaviors (i.e. those trajectories from the datasets when having the same starting positions and destinations) than IL-based methods, even when we do not directly optimize the similarity to the experts. Our learned simulation models can generalize well: Models trained using 20 agents can give similar safe results when used on up to 100 agents in the same scenario. Moreover, comparing to traditional model-based method like model predictive control (MPC), we achieve a 50X speedup when executing the model as we do not need to solve online optimization problems. In addition, we show that our agents react better in challenging traffic scenes: Our learned pedestrians will make ways (as normal pedestrians would do) to unfreeze the vehicle at crowded intersections.

The major contributions of our paper are: 1. To our knowledge, we are the first to use decentralized neural CBF for heterogeneous multi-agent systems to build safe and reactive simulation models for road users from real-world traffic data. 2. Our CBF-based agents are proved to ensure safety, being even similar to the real-world behaviors (from the data) than leading IL-based methods, in a wide range of traffic scenarios. 3. Our learned agents can generalize to unseen traffic conditions in the same scenario, and can react and coordinate with other road users in solving challenging planning problems. All of above show the great potential of our proposed approach for building safe and realistic traffic simulations, which will benefit autonomous-driving planning applications. This is our first step of using safety as the primary goal when building road user simulation models. We seek to combine this idea with IL as the next step to further improve the similarity to actual humans.

Ii Related Work

Imitation learning: IL is a popular method for mimicking expert’s behaviors from demonstrations. Behavioral cloning [4]

learns the state-action pairs from the expert data via supervised learning, and inverse reinforcement learning 

[30] seeks a cost function which prioritizes expert behaviors over other possible policies. Generative adversarial imitation learning (GAIL) [20] uses a discriminator to distinguish expert’s policy from non-expert behaviors generated by the policy network, and has been extended to share parameters [8] and augmented with rewards [7] to discourage undesirable behaviors, for modeling multi-agent driving on highway. However, those methods do no directly enforce safety, which serves as a critical factor in traffic simulations. Model-free safe RL methods [28, 5, 13, 27] can also be extended for multi-agent systems, but they draw less information from the expert trajectory. Moreover, finding the appropriate RL methods and reward functions are extremely challenging, and may lead to reward hacking, where agents learn undesired behaviors to hack for high rewards [14].

Safe control via barrier certificates: Many learning methods are combined with barrier certificates [32] and CBF [41, 11, 2] to ensure safety. ShieldNN [16] designs a NN controller guided by a given barrier function to rectify unsafe controllers’ behaviors. [40] uses Sum-of-Squares to learn a permissive barrier certificates and uses QP-based controller to ensure system’s safety. [37] uses SVM to model a CBF for safe control. [29] constructs the CBF via objects’ signed-distance function online to perform safe navigation. The frameworks for multi-agent CBF safe control are proposed in [10] and [39], and recently a decentralized multi-agent system using backup strategy is proposed by  [12]. The closest work to ours is the decentralized CBF for Multi-agent control  [34]. But in  [34], all agents are of the same type and running the same simple reach-avoid controller, in an extremely simple environmental setup without real-world considerations. Instead, our learned simulation model needs to handle real-world traffic scenarios and other users on the road, including other types of agents and agents that are non-reactive but only follow some pre-defined trajectories. Moreover, our controllers are also constructed in a hierarchical way for handling complex tasks.

Iii Preliminaries and Problem Statement

We model the real-world traffic scenes as a heterogeneous multi-agent system in a specific scenario, where road participants (aka agents) choose their actions depending on their states and observations and follow some dynamics to move forward. A scenario defines what the road participants expect to observe. Formally,

Definition 1.

A heterogeneous multi-agent system in a scenario is defined as a tuple , where

  1. Each agent is defined by its state space (e.g. pose, velocity), input space (e.g. acceleration, and angular velocity), dynamic function , and observation space . Moreover,

    • The semantics of agent dynamics are defined by trajectories, which describe the evolution of states over time. Given an input trajectory , then the state trajectory follows a differential equation:

    • At any time , each agent can obtain a local observation , which contains sensing information of its surroundings such as close-by neighbors’ position, lane markers, and static obstacles.

    • The agents are reactive as the control policies will be functions of the observations, which indicates that the agents react to the environmental changes.

  2. A scenario is defined as a set of possible values for the (state, observation) pairs of all agents: , such that for each agent , the pair has to be contained in .

We introduce the notion of scenarios because the possible values for the (state,observation) pairs vary significantly in diffenrent traffic scenes. For example, the possible observations of cars on the highway (which do not contain pedestrians or cyclists) are different from driving across the intersections (which do not contain high-speed cars). As we will see later in the experiment section, each dataset can be seen as a collection of samples from a specific scenario.

Agents sharing the same dynamics and same structure of observations are called the same type of agents. For example, cars driving on the high-way can observe the leading and following cars in the same and neighboring lanes, and pedestrians at intersections can observe other pedestrians and cars that are not blocked by other agents. In this paper, we will develop a same control policy and safety certificate for the same type of agents. In what follows, we define the concept of safety for a multi-agent system in a given scenario, and how to enforce safety using decentralized barrier certificates.

The safety for the road participants means collision-free, staying with the lane, etc. As we are working on a heterogeneous system, the measurements for the safety are different among agents. Therefore, we use functions to describe the safety. That is, the -th agent is safe if , otherwise the agent is dangerous (or unsafe).

The explicit form of could be varied for different types of agents: for example, the safety for pedestrians in crowds might be tested by the Euclidean distance measure from their locations (encoded in ) to their neighbors (encoded in ), and can be constructed as a predefined safety threshold minus this Euclidean distance. Whereas the safety checking for vehicles at intersections or roundabouts might require much more complicated functions, such as detecting the overlay of the vehicles’ contours from the bird-eye-view. Fixing a control policy, the multi-agent system is safe if all of the agents in the system are safe in the given scenario:


Control barrier functions (CBF) [2, 1] are a control theoretical tool to find a safe controller and enforce the states of dynamic systems to stay within the safe set (in a forward invariant sense). Recently decentralized CBF [34, 12] are proposed to give decentralized controllers that can guarantee the safety of multi-agent systems. In this paper, we follow this idea and propose decentralized CBF for our multi-agent system in traffic scenes as defined in Definition 1.

Proposition 1.

Given a multi-agent system in a scenario as in Definition 1, if a function which has the time derivative almost everywhere (denote as ) and satisfies the following CBF conditions when the time derivative exists:


where is the safe set for agent , is the unsafe set for agent , , is an extended class- function which is defined as a strictly increasing continuous function for some and , then we call a valid CBF for the scenario , and the safety of this multi-agent system is guaranteed as long as the initial state and observation pairs are in the safe set.


The proof for Proposition 1 follows a standard procedure of proving safety by CBFs. We first prove the agent’s safety guarantee, and then prove the guarantee for the system’s safety.

1. If the initial state and observation pair is in the safe set , we prove that will never enter the dangerous set , thus ensures the agent’s safety.

Since , we have . Let and . Because an extended class- function is local Lipschitz, solutions exist and are unique, and according to [21, Lemma 4.4], the solution is: where is a class function111A function is class- if it is class- in its first argument and, for each fixed , is strictly decreasing with respect to and .. Now since , we have by [25, Theorem 1.10.2]. Thus . So we have .

2. Now if every agent in the system has the safety guarantee, from the definition in (2), we can show that the safety of this system is guaranteed under scenario . ∎

Notice that the decentralized CBF in Proposition 1 is only defined on the given scenario . This is because handcrafting a CBF for a complex heterogeneous multi-agent system is impossible, while learning a single valid CBF for all possible scenarios is extremely difficult. The latter requires the CBF to be powerful and expressive enough to handle any traffic condition and therefore hard (if not impossible) to learn. In this paper, we take a middle-ground and only require the agents to learn CBFs that are valid for each scenario and therefore guarantee the safety only for the specific scenario.

Iv Build safe and reactive road participants models through learning neural CBF.

Given the multi-agent system as in Definition 1 and guided by the Proposition 1, we propose a decentralized learnable framework to construct autonomous reactive agents with safety as the primary objective. As shown in Fig.1, for each type of agents under the shared simulation environment, we use a hierarchical control framework that contains a high-level decision/controller, a set of reference controllers, a CBF network (served as in (3)) and a CBF-guided controller. The high-level controller is pre-trained offline using expert trajectories to capture high-level decisions made by humans that are hard to defined in CBF (e.g. when to switch lanes). Then during training in the simulation, the CBF network and CBF controller jointly learn to satisfy the CBF conditions in (3). At the inference stage, the high-level controller selects a reference controller for different purposes (e.g. staying at the center of the lane, reaching destination, etc.), and the CBF controller rectifies the reference control command to ensure system’s safety while avoids diverging too much from the agents original purposes, hence being realistic. Detailed implementations and training procedures will be presented in the rest of this section.

Fig. 1: Overview of the learning framework for reactive agents. Under each scenario, each type of agents share the same learning framework and control policies.

Iv-a Reference controllers and high-level controllers

CBF only enforces safety and therefore could lead to overly safe actions. For example, agents using only CBF-based control could choose driving off the road to stay far from others. Therefore, we need a reference controller to serve the original purposes of the agents. We adopt linear quadratic regulator (LQR) as the reference control in our experiments, but it can be any other forms. Moreover, some high-level decisions are intentions of humans and thus better to be learned from data. For example, whether to switch a lane on the highway given the current traffic condition. Therefore, we further construct a high-level controller for scenarios that would need such a high-level decision for intentions. Here the high-level controller returns a discrete value that represents the behavior pattern or decision, based on which the corresponding reference controller is chosen. Finally, the CBF controller learns to rectify the reference control commands to maintain system’s safety. The overall control input for the agent thus can be written as:


Iv-B Agent safety checking decomposition and vectorized CBF

Ideally, if a learnable barrier certificate together with a learnable controller can satisfy the conditions in (3) for a system , a scenario and sets , then the safety of can be fulfilled. However, in application the observation often lies in a high dimension space, leading learning barrier certificates from data a challenging problem.

Notice that checking the agent safety in the system normally can be further decomposed to pairwise checkings between the agent and each of its neighbors or static obstacles [12]

. For example, in a crowd scenario, a pedestrian is safe if it doesn’t collide with any of its neighbors or static facilities. The pairwise checking with neighbors will render more information (labels) than just checking the agent safety in the learning process. Inspired by this, we propose a method to efficiently solve this by using a vectorized version of CBF, which (if learned correctly) can still guarantee system’s safety.

For each agent, we assume the observation can be decomposed to the instance-specific observations with respect to its neighboring agents and other static obstacles, i.e. , where and denotes the maximum number of instances the -th agent can perceive from its local observation at one time. The safety for an agent-neighbor pair is measured by the agent’s state and its instance-specific observation: . The -th agent with its -th neighbor is in safe set if , otherwise is in dangerous set

We denote function in the form of . Now if , satisfies the CBF condition defined in (3) under corresponding sets , and , then we can construct , which will satisfy the CBF conditions defined in (3) on and . This can be proved using the Proposition 5 from [18] which discusses the sufficient conditions to construct a valid barrier function via min/max operations on top of component barrier functions.

Thus we can decompose agent’s observation to instance-specific observations, and learn a collections of “smaller” CBF to guarantee the system safety. In this way, the dimension for the input space for each CBF is greatly reduced and therefore easier to be learned. In addition, for the same type of neighbor agents, this method will provide more training samples to the corresponding CBF network, where is the average number of -th type of neighbor agents that can be observed by the -th agent in the scenario, hence greatly accelerates the learning process.

Iv-C Neural network implementations

In this paper, the high-level controller, the barrier certificate and the CBF controller

are constructed as neural networks (NN). The high-level policy is a Gated Recurrent Network which takes the agent’s state and observation as inputs, and outputs discrete high-level decisions. As for the CBF controller, inspired by 

[34], we use a PointNet [33] structured network to ensure the NN is permutation and length invariant to inputs, and outputs the control signal

. The CBF network is a multilayer perceptron, which takes the agent’s state and observation

as input and outputs a vector representing the neighbor-wise CBF value: as discussed in Section IV-B.

(a) Screenshot of VCI-DUT.
(b) CBF heatmap on VCI-DUT.
(c) Screenshot of RounD.
(d) CBF heatmap on RounD.
Fig. 2: CBF visualization on VCI-DUT/RounD datasets.

We first train the high-level controller from a collection of expert trajectories with manually labelled high-level decisions. During the training phase in simulation, agents controlled by (4) will sample trajectory data containing agent’s state, observation and action from each time step. To optimize the CBF networks and the controllers for satisfying the CBF conditions defined in (3

) on those sampled trajectories, we introduce the following loss function:




here stands for the function , and is a margin to enforce training stability. and guide the neural barrier certificates to correctly distinguish samples from the safe and dangerous sets respectively, and leads the controller and neural barrier certificates to satisfy the derivative condition . Due to the difficulties of directly computing the term in , we approximate by numerical differentiation222In this paper, we assume observations are noise-free; the numerical derivative approximation might be inaccurate for applications with noisy sensor data and we leave it for future works.: , where is the time interval for each simulation step. To sum up, guides the control barrier certificates and the controller to ensure system safety.

To avoid the CBF controller deviating too far from a normal behavior characterized by the reference controllers, we define a regularization term: to penalize for the large output values of the controller when the sample is in the safe set . Thus the total objective function becomes: , where balances the weights between system’s safety and other agent purposes, and is set to 0.1 in our experiments.

To illustrate what can be learned using the above framework, we demonstrate contour plots of the learned CBF value in two scenarios (VCI-DUT and RounD datasets, to be discussed in detail in Section V) for pedestrians and cars respectively. As shown in Fig.2, the contour plots show the value of a new agent’s CBF for each location as if that agent were assigned to be in the scene. We see that under both scenarios, our CBF network outputs negative value when the location is close to other road participants (which is dangerous), and shows positive value at places far away from other agents (safe). Therefore our learned agents will prefer to go to locations that are “bluer”.

V Simulation models for real-world scenarios

(a) HighD [22] contains 110500 vehicle trajectories at six different locations on German highways.
(b) RounD [23] contains 6 hours of trajectories from 13746 road users at different German roundabouts.
(c) SDD [35] records pedestrians and bicyclists trajectories at recording sites on Stanford campus.
(d) VCI-DUT [43] contains 1793 pedestrians trajectories interacting with vehicles at two intersections.
(e) NGSIM [31]contains 45 minutes of highway driving data on US Highway 101 and Interstate 80.
Fig. 3: Real world traffic datasets used in this paper.

We demonstrate the performance of our approach in various simulations as shown in Fig.3. Comparing to other baselines, our method achieves the lowest collision rate and RMSE, being 50X faster in terms of execution time than model-based approach like MPC, and can generalize and react reasonably to unseen conditions to help solve challenging planning problems.

V-a Implementation details, baseline and metrics

Agent types and dynamics: We categorize all agents into three major types: vehicles, cyclists and pedestrians. For vehicle-type and cyclist-type agents, we adopt a unicycle model with different constraints, whereas a double integrator model is used for pedestrian-type agents. An agent can only obtain observations within 30 meters of its surroundings.

Safety measures: The pair-wise safety measure function between vehicles is defined as the area of intersection between the vehicles’ contours from the bird-eye-view. The pair-wise safety measure function among all other combinations of agent types is defined as: where are the radius for the minimum cover circle of the agents’ contours, and is the Euclidean distance between two agents.

Reference / high-level controllers: LQR is used to give the reference control policy for the following purposes: 1. For SDD and VCI-DUT (pedestrian, cyclists and vehicles in free space/campus), LQR is designed for reaching the destination. 2. For RounD (vehicles at roundabouts), LQR is used on vehicles to keep tracking on the pre-defined road points to enter/exit and drive in the roundabout. 3. For HighD and NGSIM (vehicles on highways), LQR is designed for lane-keeping or lane-switching to the adjacent lane. High-level controllers are only used on HighD and NGSIM, where we learn from data when to switch the lane. This is because there could exist a safe controller no matter whether the vehicle switches the lane or keeps in the current lane. The lane-switch decision learned from data can better capture the intention of human drivers. For other datasets, the pedestrians and cyclists just move towards their pre-defined destinations.

Training and testing: In each simulation run, we control 10100 agents in the scene for 520 seconds (equivalent to 50 to 200 time steps), and let the rest agents in the scenario follow the expert trajectories. Training on a subset of one scenario for our approach takes 13 hours on an RTX 2080 Ti graphics card. During testing, we start from a different split of the dataset in the same scenario, perform 100 testing runs and average the metrics discussed below. For NGSIM, we follow the same testing configurations reported in [8].

Baseline approaches. For HighD, RounD, SDD and VCI-DUT datasets, we compare our approach with behavioral cloning (BC) [4], PS-GAIL [8] and MPC [17]. On the NGSIM dataset, we compare with the results of PS-GAIL [8] and RAIL [7] reported in their paper. We do not report the results of RAIL [7] on the other datasets as it is too sensitive to the handcrafted rewards. Details of the implementation of these baseline methods are as follows: Behavioral cloning is trained on the “state-observation, action” pairs from the expert trajectories using L2 loss (comparing to the ground-truth control signals) until converged, using a fully-connected NN. PS-GAIL results on NGSIM are reproduced from the original implementation as reported in [8]. For the rest datasets, we follow the network structure reported in [8]

and train in simulations for the same number of epochs as for our method.

MPC is used to replace the neural-CBF controllers for safety, and the same pre-trained high-level controller and reference controllers are used. We use Casadi [3] for MPC solving, with a goal-reaching cost and collision-avoidance constraints. The constraints are relaxed by slack variables to ensure the feasibility of the problem. The MPC results reported are fine-tuned results. RAIL employs a handcrafted augmented reward in imitation learning, and provides the best results on collision rate, off-road duration and hard brake rate on the NGSIM dataset in published literature. We use the original implementation of RAIL provided by the authors. Lastly, we compare with a state-of-the-art (none-learning) calibration methods  [9] on NGSIM and HighD. Again the models in [9] could only work for highway car-following cases so we do not compare with it on other datasets.

(a) Collision rate. (Lower is better. The y-axis is in log scale)
(b) Relative RMSE comparing to expert policy. (Lower is better)
Fig. 4: Results on HighD, RounD, SDD and VCI-DUT.

Metrics: We compute the following measurements for safety and similarity to real-world data: Collision rate reflects systems’ safety, and is calculated by the number of agent states that fall into the unsafe set divided by the total number of agent states. Root Mean Square Error (RMSE) reflects how much the agents’ policy diverges from the expert demonstrations. This is computed by the root mean square error of the position comparing to the ground-truth position shown in expert trajectories. For HighD, RounD, SDD and VCI-DUT, we present the relative RMSE which is further normalized under each dataset by the highest RMSE from all approaches. Driving related metrics: For NGSIM, we also report other metrics related to driver behaviors on highways, such as off-road rates, hard-brake rates and average lane changes per agent as discussed in PS-GAIL and RAIL.

V-B Main results

Results on HighD, RounD, SDD and VCI-DUT: Fig.4 shows the collision rate and RMSE comparison with BC, PS-GAIL, and MPC. Our approach can achieve

reduction in collision rates across all datasets. We can see that model-free methods (PS-GAIL and BC) only get similar collision rates as MPC under simple scenarios such as HighD (where agents mostly move along the same direction with constant speeds) and SDD (large free space with low-speed pedestrians and cyclists). Though MPC is quite competitive as it directly optimizes under collision-avoidance constraints, it needs to solve an online optimization problem with complex constraints for each agent. For each simulation step, it takes 0.5 second for the MPC solver (therefore cannot really be used online), whereas our approach only needs 0.01 second, being 50X faster. Note that our approach still could not achieve zero collisions, which is probably caused by: 1) noisy measurements from the expert trajectories (especially, SDD is a very noisy dataset and therefore our method has a higher collision rate on SDD than other scenarios) 2) the learned CBF is only valid on training samples, and there is a distribution shift between the training and testing data.

Surprisingly, our approach can also achieve much lower RMSE comparing to imitation learning. This implies that fixing the initial and destination, by moving safely we get simulation models that are even closer to human behaviors. Of course the high-level decision contributes to reducing RMSEs. Again only on the HighD dataset can IL methods achieve a smaller RMSE, which shows their limitations for handling complex scenarios. Using the same high-level policy and reference controllers, we can also have smaller RMSE than the model-based approach MPC. This presents the great potential of combining model-based approaches and learning frameworks, which can produce both safer and more realistic models for human road users.

Metrics Dataset Ours IDM[9] IDM[38] GAIL [8]
Number of collisions NGSIM
TABLE I: Results on NGSIM and HighD under low-density scenarios (20 agents, 5 seconds).
(a) Comparison for driver-related metrics with the expert policy, PS-GAIL and RAIL using the same training/testing splits as in [8] and [7]. The y-axis is in logarithm scale.
(b) Comparison of generalizability: Comparing to RAIL, our method can still maintain a low collision rate when testing traffic density is different from the one used in training.
Fig. 5: Comparisons on the NGSIM dataset.

Results on NGSIM and HighD: We first test our approach for NGSIM and HighD datasets under low density traffic (20 agents, no non-reactive agents, 5 seconds) as what was used in [9]. We compare with  [9], PS-GAIL [8] and a classical car-following model IDM [38]. As shown in Table.I, without any calibration, our approach can achieve zero collision rates on both datasets and obtains the lowest RMSE in position and velocity, which shows the power of our learned CBF in terms of providing safety guarantees. For challenging dense traffic cases in NGSIM (100 agents together with around 70 other non-reactive agents following expert trajectories, 20 seconds), we further compare with PS-GAIL [8] and RAIL [7] as shown in Fig.5. We also put the “expert policy” in the figure for reference. Here “RAIL-Binary” adopted a binary augmented reward, whereas “RAIL-Smooth” introduced a continuous reward. With similar lane changing rates333We calculate lane change rates as the number of lane-changings in the 20-second-simulation divided by the total number of agents, which is different than in [7]. We believe our calculation is more accurate, as supported by the statistics shown in [26, Figure 3.2.]. RAIL’s lane changing rates are updated using our new calculation method., our collision rate is almost to the “RAIL-Smooth”. Our method also achieves a lower off-road duration. Comparing to RAIL, our approach does result in a slightly higher hard brake rate, which might due to the collision-avoidance intention in our CBF controller. Our RMSE is similar to RAIL and thus is not reported here.

To show the generalizability of our approach, we train the vehicle in NGSIM using only agents then test on an increasing number of agents from to . Our learned model can adapt to different traffic density and maintain a similar level of collision rates as shown in Fig.5 (b). As a comparison, RAIL is trained using agents but behaves poorly when the density changes. When the density is low (), the collision rate for RAIL goes even higher.

V-C Planning for ego-vehicles with reactive and safe agents

(a) Pedestrians that follow expert trajectories.
(b) Pedestrians running PS-GAIL models.
(c) Our learned reactive and safe pedestrians.
Fig. 6: Simulations for a vehicle attempting to drive through a crowd with different pedestrian models.

To demonstrate the use of reactive and safe agents we have learned, we design a challenging planning task for a vehicle to drive through a crowded street, where the scene is taken from VCI-DUT. We consider three different pedestrian models: 1) agents that follow fixed expert trajectories 2) agents built using PS-GAIL and 3) our learned reactive agents. We use the same MPC controller for the car to drive through the crowd while enforcing safe distances to pedestrians. As shown in Fig.6, when interacting with our learned reactive agents, the MPC-controlled car can successfully travel through the crowd safely, as our learned agents use a neural-CBF to produce safe reactions and therefore make a path for the car. Whereas in other cases the car is frozen and even goes backward as the pedestrians are not reacting to make a way for the car. This shows safety-guided reactive agents can help solve some of the conservative issues in the developing self-driving algorithms. Note that in reality whether we can assume the pedestrians will always keep safe is another question that is out of the scope of our paper.

Vi Conclusion

In this paper we proposed a way to generate reactive and safe road participants for traffic simulation. We start from a safety-guided learning approach, achieve the best performance among several real world traffic datasets and show the advantage of reactive agents in planning. Future work includes: 1) combine our CBF framework with imitation learning to further guide agents to behave like human and 2) integrate our models in more mature simulators like CARLA.


The authors acknowledge support from the NASA University Leadership initiative (grant #80NSSC20M0163) and from the Defense Science and Technology Agency in Singapore. The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of any NASA entity, DSTA Singapore, or the Singapore Government.


  • [1] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada (2019) Control barrier functions: theory and applications. In 2019 18th European Control Conference (ECC), pp. 3420–3431. External Links: Link Cited by: §I, §III.
  • [2] A. D. Ames, J. W. Grizzle, and P. Tabuada (2014) Control barrier function based quadratic programs with application to adaptive cruise control. In 53rd IEEE Conference on Decision and Control, pp. 6271–6278. Cited by: §I, §II, §III.
  • [3] J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl (2019) CasADi – A software framework for nonlinear optimization and optimal control. Mathematical Programming Computation 11 (1), pp. 1–36. External Links: Document Cited by: §V-A.
  • [4] M. Bain and C. Sammut (1995) A framework for behavioural cloning.. In Machine Intelligence 15, pp. 103–129. Cited by: §I, §II, §V-A.
  • [5] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause (2017) Safe model-based reinforcement learning with stability guarantees. arXiv preprint arXiv:1705.08551. Cited by: §II.
  • [6] A. Best, S. Narang, D. Barber, and D. Manocha (2017) Autonovi: autonomous vehicle planning with dynamic maneuvers and traffic constraints. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2629–2636. Cited by: §I.
  • [7] R. P. Bhattacharyya, D. J. Phillips, C. Liu, J. K. Gupta, K. Driggs-Campbell, and M. J. Kochenderfer (2019) Simulating emergent properties of human driving behavior using multi-agent reward augmented imitation learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 789–795. Cited by: §I, §II, 4(a), §V-A, §V-B, footnote 3.
  • [8] R. P. Bhattacharyya, D. J. Phillips, B. Wulfe, J. Morton, A. Kuefler, and M. J. Kochenderfer (2018) Multi-agent imitation learning for driving simulation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1534–1539. Cited by: §II, 4(a), §V-A, §V-A, §V-B, TABLE I.
  • [9] R. P. Bhattacharyya, R. Senanayake, K. Brown, and M. J. Kochenderfer (2020)

    Online parameter estimation for human driver behavior prediction

    In 2020 American Control Conference (ACC), pp. 301–306. Cited by: §V-A, §V-B, TABLE I.
  • [10] U. Borrmann, L. Wang, A. D. Ames, and M. Egerstedt (2015) Control barrier certificates for safe swarm behavior. IFAC-PapersOnLine 48 (27), pp. 68–73. Cited by: §II.
  • [11] Y. Chen, H. Peng, and J. Grizzle (2017) Obstacle avoidance for low-speed autonomous vehicles with barrier function. IEEE Transactions on Control Systems Technology 26 (1), pp. 194–206. Cited by: §II.
  • [12] Y. Chen, A. Singletary, and A. D. Ames (2020) Guaranteed obstacle avoidance for multi-robot operations with limited actuation: a control barrier function approach. IEEE Control Systems Letters 5 (1), pp. 127–132. Cited by: §II, §III, §IV-B.
  • [13] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick (2019) End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3387–3395. Cited by: §II.
  • [14] J. Clark and D. Amodei (2016) Faulty reward functions in the wild. Internet: https://blog. openai. com/faulty-reward-functions. Cited by: §II.
  • [15] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. In Conference on robot learning, pp. 1–16. Cited by: §I.
  • [16] J. Ferlez, M. Elnaggar, Y. Shoukry, and C. Fleming (2020) Shieldnn: a provably safe nn filter for unsafe nn controllers. arXiv preprint arXiv:2006.09564. Cited by: §II.
  • [17] C. E. Garcia, D. M. Prett, and M. Morari (1989) Model predictive control: theory and practice—a survey. Automatica 25 (3), pp. 335–348. Cited by: §V-A.
  • [18] P. Glotfelter, J. Cortés, and M. Egerstedt (2017) Nonsmooth barrier functions with applications to multi-robot systems. IEEE control systems letters 1 (2), pp. 310–315. Cited by: §IV-B.
  • [19] D. Hadfield-Menell, S. Milli, P. Abbeel, S. Russell, and A. Dragan (2017) Inverse reward design. arXiv preprint arXiv:1711.02827. Cited by: §I.
  • [20] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. arXiv preprint arXiv:1606.03476. Cited by: §II.
  • [21] H. K. Khalil and J. W. Grizzle (2002) Nonlinear systems. Vol. 3, Prentice hall Upper Saddle River, NJ. Cited by: §III.
  • [22] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein (2018) The highd dataset: a drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2118–2125. Cited by: §I, 2(a).
  • [23] R. Krajewski, T. Moers, J. Bock, L. Vater, and L. Eckstein (2020) The round dataset: a drone dataset of road user trajectories at roundabouts in germany. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6. Cited by: §I, 2(b).
  • [24] D. Krajzewicz (2010) Traffic simulation with sumo–simulation of urban mobility. In Fundamentals of traffic simulation, pp. 269–293. Cited by: §I.
  • [25] V. Lakshmikantham and S. Leela (1969)

    Differential and integral inequalities: theory and applications: volume i: ordinary differential equations

    Academic press. Cited by: §III.
  • [26] S. E. Lee, E. C. Olsen, W. W. Wierwille, et al. (2004) A comprehensive examination of naturalistic lane-changes. Technical report United States. National Highway Traffic Safety Administration. Cited by: footnote 3.
  • [27] X. Li and C. Belta (2019) Temporal logic guided safe reinforcement learning using control barrier functions. arXiv preprint arXiv:1903.09885. Cited by: §II.
  • [28] A. Liu, G. Shi, S. Chung, A. Anandkumar, and Y. Yue (2020) Robust regression for safe exploration in control. In Learning for Dynamics and Control, pp. 608–619. Cited by: §II.
  • [29] K. Long, C. Qian, J. Cortés, and N. Atanasov (2020) Learning barrier functions with memory for robust safe navigation. arXiv preprint arXiv:2011.01899. Cited by: §II.
  • [30] A. Y. Ng, S. J. Russell, et al. (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: §I, §II.
  • [31] NGSIM. next generation simulation. Note: Cited by: §I, 2(e).
  • [32] S. Prajna and A. Jadbabaie (2004) Safety verification of hybrid systems using barrier certificates. In International Workshop on Hybrid Systems: Computation and Control, pp. 477–492. Cited by: §II.
  • [33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    Pointnet: deep learning on point sets for 3d classification and segmentation


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 652–660. Cited by: §IV-C.
  • [34] Z. Qin, K. Zhang, Y. Chen, J. Chen, and C. Fan (2021) Learning safe multi-agent control with decentralized neural barrier certificates. arXiv preprint arXiv:2101.05436. Cited by: §I, §II, §III, §IV-C.
  • [35] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese (2020) Learning social etiquette: human trajectory prediction in crowded scenes. In European Conference on Computer Vision (ECCV), Cited by: §I, 2(c).
  • [36] S. Russell (1998) Learning agents for uncertain environments. In

    Proceedings of the eleventh annual conference on Computational learning theory

    pp. 101–103. Cited by: §I.
  • [37] M. Srinivasan, A. Dabholkar, S. Coogan, and P. Vela (2020)

    Synthesis of control barrier functions using a supervised machine learning approach

    arXiv preprint arXiv:2003.04950. Cited by: §II.
  • [38] M. Treiber and A. Kesting (2017) The intelligent driver model with stochasticity-new insights into traffic flow oscillations. Transportation research procedia 23, pp. 174–187. Cited by: §V-B, TABLE I.
  • [39] L. Wang, A. D. Ames, and M. Egerstedt (2017) Safety barrier certificates for collisions-free multirobot systems. IEEE Transactions on Robotics 33 (3), pp. 661–674. Cited by: §II.
  • [40] L. Wang, D. Han, and M. Egerstedt (2018) Permissive barrier certificates for safe stabilization using sum-of-squares. In 2018 Annual American Control Conference (ACC), pp. 585–590. Cited by: §II.
  • [41] P. Wieland and F. Allgöwer (2007) Constructive safety using control barrier functions. IFAC Proceedings Volumes 40 (12), pp. 462–467. Cited by: §II.
  • [42] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner (2000) Torcs, the open racing car simulator. Software available at http://torcs. sourceforge. net 4 (6), pp. 2. Cited by: §I.
  • [43] D. Yang, L. Li, K. Redmill, and Ü. Özgüner (2019) Top-view trajectories: a pedestrian dataset of vehicle-crowd interaction from controlled experiments and crowded campus. In 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 899–904. Cited by: §I, 2(d).