Simulation is a crucial tool for accelerating the development of autonomous driving software because it can generate adversarial interactions for training autonomous driving policies, play out counterfactual scenarios of interest, and estimate safety-critical metrics. In this way, simulation reduces reliance on real-world data, which can be expensive and/or dangerous to collect. As autonomous vehicles share public roads with human drivers, cyclists, and pedestrians, the underlying simulation tools require realistic models of these human road users.
Such models can be obtained by applying learning from demonstration (LfD) [3, 21] to example trajectories of human road use collected using sensors (e.g., cameras and LIDAR) already mounted on cars on the road. Such demonstrations are ideally suited to learning the realistic road use behaviour needed for autonomous driving simulation.
and generative adversarial imitation learning are typically insufficient for producing realistic models of human road users. Despite minimising their supervised or adversarial losses, the resulting policies frequently collide with other road users or drive off the road.
To address this problem, we propose Symphony
, a new approach to LfD for autonomous driving simulation that can greatly improve the realism of the learned behaviour. A key idea behind Symphony is to combine conventional policies, represented as neural networks, with aparallel beam search that refines these policies on the fly. As simulations are rolled out, Symphony prunes branches that are unfavourably evaluated by a discriminator trained to distinguish agent behaviour from that in the data. Because the beam search is parallelised, promising branches are repeatedly forked, focusing computation on the most realistic rollouts. In addition, since the tree search is also performed during training, the pruning mechanism drives the agent towards more realistic states that increasingly challenge the discriminator. The results of each tree search can then be distilled back into the policy itself, yielding an adversarial algorithm.
However, simply learning realistic agents is not enough. They must also be diverse, i.e., cover the entire distribution of realistic behaviour, in order to enable a full evaluation of autonomous driving software. Unfortunately, while the use of beam search improves realism, it tends to harm diversity: repeated pruning can encourage mode collapse, where only the easiest to simulate modes are represented. To address this issue, Symphony takes a hierarchical approach, factoring agent behaviour into goal generation and goal conditioning. For the former, we train a generative model that proposes goals in the form of routes, which capture high-level intent. For the latter, we train goal-conditional policies that modulate their behaviour based on a goal provided as input. Generating and conditioning on diverse goals ensures that agent diversity neither disappears during adversarial training nor is pruned away by the beam search.
We evaluate Symphony agents with extensive experiments on run segments from the Waymo Open Motion Dataset  and a proprietary Waymo dataset consisting of demonstration trajectories and their corresponding contexts, created by applying Waymo’s perception tools to sensor data collected by Waymo vehicles driving on public roads. We report performance on several realism and diversity metrics, including a novel diversity metric called curvature Jensen-Shannon divergence that indicates how well the high-level agent behaviour matches the empirical distribution. Our results confirm that combining beam search with hierarchy yields more realistic and diverse behaviour than several baselines.
Ii-a Problem Setting
The sequential process that generates the demonstration behaviour is multi-agent and general sum and can be modeled as a Markov game  . is the number of agents; is the state space; is the action space of a given agent, (the same for all agents), such that the joint action and observation spaces are and ; is the set of reward functions (one for each agent); is the initial state distribution; and
is a discount factor. The dynamics are described by a transition probability function, where is a joint action; is also shown in bold because it factors similarly to . The agent’s actions are determined by a joint policy that is agent-wise factored, i.e., .
To avoid superfluous formalism, we do not explicitly consider partial observability. However, this is easily modelled by masking certain state features in the encoders upon whose output the agents condition their actions.
Because we have access to a simulator, the transition function is considered known. However, since we are learning from demonstration, the agents’ reward functions are unknown. Furthermore, we do not even have access to sample rewards. Instead, we can merely observe the agents’ behavior, yielding a dataset of trajectories where is a trajectory generated by the ‘expert’ joint policy . The goal of LfD is to set parameters such that a policy matches in some sense.
In our case, the data consists of LIDAR and camera readings recorded from an ego vehicle on public roads. It is partitioned into run segments and preprocessed by a perception system yielding the dataset . The states and discrete actions in each run segment are fit greedily to approximate the logged trajectory. Each state is a tuple containing three kinds of features. The first is static scene features such as locations of lanes and sidewalks, including a roadgraph, a set of interconnected lane regions with ancestor, descendant, and neighbour relationships that describes how agents can move, change lanes, and turn. The second is dynamic scene features such as traffic light states. The third is features describing the position, velocity, and orientation of the agents. Together, these yield the tuple: . By convention,
is the state of the ego vehicle. Since agents can enter or leave the ego agent’s field of view during a run segment, we zero-pad states for missing agents to maintain fixed dimensionality.
To avoid the need to learn an explicit model of initial conditions, we couple each simulation to a reference trajectory. Each agent is initialised to the corresponding after which it can either be a playback agent that blindly replays the behaviour in the reference trajectory or an interactive agent that responds dynamically to the unfolding simulation using a policy learned from demonstration.
During a Symphony simulation, the state of an interactive agent is determined by sampling actions from its policy and propagating it through . Given a reference trajectory , this yields a new simulated trajectory in which and remain as they were in the reference trajectory but the agent states of the interactive agents are altered.
Ii-B Behavioural Cloning
, which solves a supervised learning problem:is interpreted as a labeled training set and is trained to predict given . If we take a maximum likelihood approach, then we can optimise as follows:
This approach is simple but limited. As it optimises only the conditional policy probabilities, it does not ensure that the underlying distribution of states visited by and match. Consequently it suffers from covariate shift , in which generalisation errors compound, leading to states far from those visited by .
Ii-C Generative Adversarial Imitation Learning
BC is a strictly offline method because it does not require interaction with an environment or simulator. Given , it simply estimates offline using supervised learning. By contrast, most LfD methods are interactive, repeatedly executing in the environment and using the resulting trajectories to estimate a gradient with respect to .
Interactive methods include inverse reinforcement learning
inverse reinforcement learning[29, 1, 31, 44] and adversarial methods such as generative adversarial imitation learning (GAIL) . GAIL borrows ideas from GANs  and employs a discriminator that is trained to distinguish between states and actions generated by the agents from those observed in . The discriminator is then used as a cost (i.e., negative reward) function by the agents, yielding increasingly log-like behaviour. In our multi-agent setting, the GAIL objective can be written as:
where is the distribution over states induced by and is here treated as an empirical distribution over states. Although the agents who generated are not cooperative (as modelled by the different reward functions in ), the learned agents controlled by are cooperative because they all aim to minimise the same discriminator , i.e., they share the goal of realistically imitating .
Differentiating through is typically not possible because is unknown. Hence, updating requires using a score-function gradient estimator 
, which suffers from high variance. However, in our setting,is both known and differentiable, so we can employ model-based GAIL (MGAIL) , which exploits end-to-end differentiability by directly propagating gradients from to through .
Symphony is a new approach to LfD for autonomous driving simulation that builds upon a base method such as BC or MGAIL by adding a parallel beam search to improve realism and a hierarchical policy to ensure diversity. Figure 1 gives an overview of the training process for the case where the base method is BC. Training proceeds by sampling a batch of run segments from a training set and using them as reference trajectories to perform new rollouts with the current policy. At the start of each rollout, a goal generating policy proposes a goal, based on initial conditions, that remains fixed for the rollout and is input to the goal-conditional policy that proposes actions. These actions are used to generate nodes in a parallel beam search (see Figure 2), which periodically prunes away branches deemed unfavourable by a discriminator and copies the rest to maintain a fixed-width beam.
Unlike in model-predictive control  or reinforcement learning methods that employ online tree search, e.g., [34, 40], in which an agent ‘imagines’ various futures before selecting a single action, each rollout in Symphony is executed directly in the simulator. However, because simulations happen in parallel, promising branches can be duplicated during execution to replace unpromising ones, focusing computation on the most realistic rollouts.
Finally, we use the resulting rollouts to compute losses and update the goal generating policy, the goal-conditional policy, and the discriminator. During inference at test time, the process is the same except that run segments are sampled from a test set and no parameters are updated.
In the rest of this section, we provide more details about the parallel beam search, hierarchical policy, network architectures, and learning rules.
Iii-a Parallel Beam Search
For each reference trajectory in the batch, we first sample actions from the joint policy, yielding branches that roll out in parallel. We then call the discriminator at each simulation step to score each of the
interactive agents in each branch, yielding a tensor of dimensionwhere is the batch size and is the number of time steps between pruning/resampling. After every simulation steps, we aggregate the discriminator scores across the time and interactive agent dimensions, yielding a tensor of shape containing a score for each sample in the batch. We aggregate by maximising across time and summing across agents. We then rank samples by aggregate score and prune away the top half (i.e., the least realistic). We tile the remaining samples such that remains constant throughout the simulation. We use , i.e., pruning and resampling occurs every 2 seconds of simulation. During training, we use but during inference .
Pruning based on aggregate scores means that the simulation at a given timestep can be subtly influenced by future events, i.e., actions are pruned away because they lead to unrealistic future states, and those states include observations of playback agents. In other problem settings, such as behaviour prediction [26, 12, 13] such leakage would be problematic because information about the future would not be available at inference time (the whole point is to predict the future given only the past). However, in our setting the reference trajectory is available even at inference time, i.e., the goal is simply to generate realistic and diverse simulations given the reference trajectory. While leakage can in principle yield useful hints, it can also be misleading as any leaked hints can become obsolete when interactive agents diverge from the reference trajectory. In practice, as we show in Section V-C, refining simulation on the fly through beam search can drastically improve realism but tends to harm diversity: repeated pruning can encourage mode collapse, where only the easiest to simulate modes are represented. Next we discuss a hierarchical approach to remedy this issue.
Iii-B Hierarchical Policy
To mitigate mode collapse, we employ hierarchical agent policies. At the beginning of each rollout, a high-level goal generating policy proposes a goal , based on an initial state , that remains fixed throughout the rollouts and is provided to both the low-level goal-conditional policy and the discriminator . The goal generating policy is trained to match the distribution of goals in the training data:
Because the same goal is used for all rollouts within the search tree, it cannot be biased by the discriminator.
We use routes, represented as sequences of roadgraph lane segments, as goals because they capture high-level intent and are a primary source of multi-modality. A feasible set of routes is generated by following all roadgraph branches, beginning at the lane segment corresponding to the agent’s initial state. From this set, routes with minimal displacement error from the observed trajectory are used as ground truth to train and as input to during training. Hence, and can be seen as learned versions of the router and planner, respectively, in a conventional control stack.
Iii-C Architecture and Learning
For each interactive agent, objects (such as other cars, pedestrians and cyclists), as well as static and dynamic features are all encoded individually using MLPs, followed by max-pooling across inputs of the same type. The resulting type-specific embeddings are, together with an encoding of features of the interactive agent, concatenated and provided to the policy head as input. Spatial information such as location or velocities of other objects are normalised with respect to the agent before being passed into the network. Furthermore, roads and lanes are represented as a set of points. In large scenes, only the nearest 16 objects and 1static and dynamic features are included.
For BC, the goal-conditional policy head maps the concatenated embeddings to a action space of discretised accelerations and steering angles. For MGAIL, we use a continuous action space specifying -
displacement to facilitate end-to-end differentiation. The goal generating policy maps to softmax logits for each feasible route, up to a limit of 200 routes. The goal generating and goal-conditional policies use separate encoders. The discriminator uses a similar but simpler encoding by max-pooling across all objects and point features within 20 metres. We train both policies and the discriminator simultaneously. We train the goal generating policy usingeq. 3 and the discriminator using:
where is generated by the tree search with =4. We train the goal-conditional policy using either BC or MGAIL. In the case of BC, we increase its robustness to covariate shift by training not only on expert data, but also on additional data sampled from , i.e., the beam search is distilled back into the goal-conditional policy, yielding an adversarial method even without the use of MGAIL. Each training batch contains 16 run segments of 10 seconds each, for which actions are recomputed every 0.2 seconds.
Iv Related Work
Iv-a Coping with Covariate Shift
One way to address covariate shift in BC is to add actuator noise when demonstrations are performed, forcing the demonstrator to label a wider range of states . However, this requires intervening when demonstrations are collected, which is not possible in our setting. Another solution is DAgger , where the demonstrator labels the states visited by the agent as it learns, which also requires access to the demonstrator that is not available in our setting. Adversarial methods such as GAIL and MGAIL avoid covariate shift by repeatedly trying policies in the environment and minimising a divergence between the resulting trajectories and the demonstrations. When the environment is not available, methods that match state distributions can retain BC’s strictly offline feature while minimising covariate shift . However, in our setting, remaining strictly offline is not necessary as we have access to a high quality simulator that is itself the target environment for the learned agents.
Iv-B Combining Planning and Learning
When a model of the environment dynamics is available, deliberative planning can help to predict the value of different actions. Model-based reinforcement learning typically uses planning during training to reduce the variance of value estimates. When the model is differentiable, this can also be exploited, as in MGAIL [5, 6] to reduce the variance of gradient estimates [39, 25, 16]. By contrast, online planning typically uses tree search to refine policies on the fly during inference by focusing computation on the most relevant states [40, 38]. By distilling the results of the tree search back into the policy, online planning also serves as an extended policy improvement operation [2, 36, 35, 10, 19, 33]. Recently, sequential decision making problems have been reformulated as auto-regressive models using transformer architectures . Most related to Symphony is the Trajectory Transformer , which is fully differentiable and uses beam search but without a discriminator.
Iv-C Autonomous Driving Applications
As early as 1989, ALVINN 
, a neural network trained with BC, autonomously controlled a vehicle on public roads. More recently, deep learning has been used to train autonomous driving software end-to-end with BC and perturbation-based augmentations have been used to mitigate covariate shift . As simulation emerges as a crucial tool in autonomous driving, interest is turning to how to populate simulators with realistic agents. ViBe  learns such models from CCTV data collected at intersections, using GAIL but without the tree search or hierarchical components of Symphony. SimNet  produces such models using only BC but uses GANs instead of reference trajectories to generate initial simulation conditions. TrafficSim  also uses hierarchical control like Symphony but with a latent variable model and without a tree search for online refinement. AdvSim  is similar but generates adversarial perturbations to challenge the full autonomous driving stack. Like Symphony, SMARTS  considers the realism and diversity of agents in a driving simulator, but employs only reinforcement learning, not LfD, to learn such agents. nuPlan  is a planning benchmark that uses a set of reference trajectories but does not simulate agent observations, feeding the observations from the reference trajectory even if they diverge from the simulation.
V Experiments & Results
V-a Experimental Setup
Datasets. We use two datasets. The first, a proprietary dataset created by applying Waymo’s perception tools to sensor data collected by Waymo vehicles driving on public roads, contains run segments each with 30 of features at 15. The second dataset consists of run segments from the Waymo Open Motion Dataset (WOMD) , which we extract to 10 run segments sampled at 15.111While we use the same run segments as the WOMD, states contain the features described in Section II-A, not those in the WOMD. Both datasets exclude run segments containing more than 256 playback agents, 10 roadgraph points, or fewer than agents at the initial timestep. Unlike the proprietary dataset, WOMD’s run segments were selected to contain pairwise interactions such as merges, lane changes, and intersection turns. Both datasets split the demonstration data into disjoint sets and with and . For the proprietary dataset, and and for the WOMD, and .
Simulation setup. Unless stated otherwise, each simulation lasts for 10, with initial conditions set by the reference trajectory and actions taken at 5
. Unless stated otherwise, the ego vehicle and one other vehicle are interactive, i.e., controlled by our learned policy, while the rest are playback agents. The interactive agent is chosen heuristically depending on the context, e.g., in merges it is the vehicle with which the ego vehicle is merging. If no such context applies, the nearest moving vehicle is chosen.
We consider the following three realism metrics. Collision Rate is the percent of run segments that contain at least one collision involving an interactive agent. A collision is detected when two bounding boxes overlap. Off-road Time is the percent of time that an interactive agent spends off the road. ADE is the average displacement error between each joint reference trajectory and the corresponding trajectory generated in simulation:
where is the set of indices of the interactive agents, and are the states of the th interactive agents in the th simulated and reference trajectories respectively, and is a Euclidean distance function.
We also consider two diversity metrics. MinSADE is the minimum scene-level average displacement error , which extends ADE to measure diversity instead of just realism. During evaluation, the simulator populates a set by simulating trajectories for each reference trajectory in and then computes minSADE as follows:
When , minSADE reduces to ADE; when , minimising minSADE requires populating with diverse but realistic trajectories. Our experiments use . Low minSADE therefore implies good coverage of behaviour modes. However, it does not imply actually matching the empirical distribution of behaviours, e.g., low-probability modes may be over-represented. Curvature JSD is a novel diversity metric that aims to measure how well the high-level behaviour matches the empirical distribution. It is computed using the roadgraph features in . Multiple lane regions that share a common ancestor are called branching regions because they represent places where agents have multiple, branching choices, e.g., a lane approaching a four-way stop may branch into three descendant regions each going in a different direction at the intersection. For each branching region, we compute the average curvature across the region. The curvature JSD is then the Jensen-Shannon divergence between the distribution of average curvatures of the branching regions visited by the policy and reference trajectories. These distributions are approximated with histograms with bins of width in the range , yielding bins.
|Method||Proprietary Dataset||Waymo Open Motion Dataset|
|BC||16.65 0.41||2.16 0.08||5.80 0.07||2.16 0.04||2.82 0.26||24.65 0.32||2.75 0.12||4.81 0.10||1.76 0.04||1.32 0.28|
|BC + H||17.25 1.07||1.69 0.12||5.18 0.09||2.01 0.03||1.38 0.11||23.27 0.38||2.40 0.05||4.38 0.04||1.67 0.03||3.50 1.25|
|BC + TS||1.84 0.13||0.35 0.03||4.83 0.09||2.07 0.04||5.28 1.14||4.94 0.65||1.23 0.04||4.20 0.15||1.82 0.07||5.84 1.02|
|BC + TS + H||1.80 0.07||0.34 0.01||4.30 0.05||1.96 0.04||1.24 0.14||4.86 0.24||1.30 0.06||3.70 0.09||1.66 0.04||2.82 1.13|
|MGAIL||5.34 0.32||0.83 0.08||7.32 0.52||3.95 0.42||1.88 0.18||9.48 0.91||1.62 0.15||5.70 0.44||3.13 0.26||4.14 1.36|
|MGAIL + H||4.16 0.18||0.76 0.03||4.52 0.17||2.48 0.09||1.55 0.17||7.39 0.37||1.65 0.12||3.82 0.12||2.15 0.08||3.88 0.80|
|MGAIL + TS||2.97 0.16||0.72 0.20||6.83 0.48||3.80 0.39||4.14 1.08||4.36 0.21||1.22 0.02||4.26 0.19||2.51 0.11||5.42 1.26|
|MGAIL + TS + H||2.40 0.19||0.70 0.06||4.69 0.15||2.73 0.12||2.35 0.53||4.89 0.38||1.65 0.25||3.86 0.17||2.22 0.13||2.80 0.60|
Proprietary and Waymo Open Motion Dataset results and standard errors.
|Method||Longer Rollouts (20 seconds)||More Interactive Agents ()|
|BC||25.56 0.69||2.99 0.09||11.60 0.21||4.08 0.05||2.61 0.20||17.46 0.79||1.06 0.15||6.40 0.51||1.86 0.14||1.49 0.25|
|BC + H||30.33 0.56||3.14 0.22||9.83 0.18||3.83 0.05||4.17 0.37||18.60 0.81||0.95 0.06||5.85 0.08||1.83 0.08||1.73 0.36|
|BC + TS||6.05 0.31||0.72 0.10||8.92 0.30||3.97 0.11||4.90 0.59||4.17 0.23||0.38 0.02||6.22 0.38||2.10 0.20||2.76 0.50|
|BC + TS + H||7.76 0.12||0.66 0.02||7.74 0.15||3.72 0.09||2.69 0.22||5.18 0.24||0.36 0.03||5.68 0.16||2.03 0.05||0.96 0.12|
|MGAIL||14.15 0.83||1.03 0.19||14.18 0.97||9.34 0.45||6.79 1.70||11.08 1.29||0.47 0.02||12.30 0.62||6.55 0.65||4.14 1.06|
|MGAIL + H||14.73 1.93||1.72 0.24||10.52 1.17||6.58 0.60||3.09 0.62||6.79 0.11||0.37 0.03||5.74 0.34||2.80 0.26||1.61 0.40|
|MGAIL + TS||10.52 0.94||0.99 0.03||15.76 1.17||11.80 1.22||5.93 0.71||9.50 0.93||0.51 0.06||7.46 0.79||3.66 0.53||2.65 0.41|
|MGAIL + TS + H||7.81 0.78||0.88 0.03||9.11 0.73||6.39 0.64||2.66 0.59||5.80 0.47||0.36 0.02||5.47 0.07||2.69 0.04||1.24 0.15|
We compare BC and MGAIL as is, with hierarchy (BC+H, MGAIL+H), with tree search (BC+TS, MGAIL+TS), and with both (BC+TS+H, MGAIL+TS+H). We train each method for update steps with run segments sampled uniformly from and save checkpoints every steps. We then select the checkpoint with the lowest sum of collision rate and off-road time on a validation set of 200 run segments and test it using all of . For each run segment in , we generate 16 rollouts and report the average (or minimum for minSADE). We average all results over five independent seeds per method. For each metric, we indicate the best performing BC and MGAIL methods in bold. For collision rate and off-road time, we also report values for playing back the logs without interactive agents but computing these metrics for the agents that would have been interactive. These values are slightly positive due to, e.g., perception errors on objects far from the ego vehicle or the use of bounding boxes instead of contours.
Table I shows our main results, comparing all methods across all metrics on both datasets. Comparing BC methods first, it is clear that tree search dramatically improves realism (especially with respect to collision rate and off-road time) but reduces diversity due to mode collapse. This loss is detected by curvature JSD, which measures distribution matching, but not by minSADE, which only requires coverage. However, the addition of hierarchy improves diversity in nearly all cases. In particular, hierarchy is crucial for addressing the mode collapse from tree search. BC+TS+H is the only BC method that gets the best of both worlds with strong performance on both realism and diversity metrics.
Figure 3 shows the histograms used to compute the curvature JSD values in Table I for three BC methods, with learned policies in orange and the log reference trajectories in blue. While BC matches distributions well, adding tree search leads to under-representation of positive curvature, i.e., right turns, a deficiency repaired with hierarchical policies.
Turning now to MGAIL methods, similar trends emerge. Tree search improves realism, though the effect is less dramatic as MGAIL is already adversarial even without tree search. Similarly, while tree search also increases curvature JSD in MGAIL, the effect is smaller. This is also to be expected given that MGAIL is already adversarial and can thus experience mode collapse even without tree search. On both datasets, the addition of hierarchy substantially improves MGAIL’s diversity metrics. Overall, the best BC methods perform better than the best MGAIL methods on nearly all metrics, though the differences are modest.
To see if we can maintain realism for longer time horizons, we repeat our experiments on the proprietary dataset with the rollout length doubled to 20 in both training and testing. The left side of Table II shows the results. As expected, all methods obtain higher values on nearly all metrics in this more challenging setup. However, the relative performance of the methods remains similar to that shown in Table I. Tree search methods perform much better with respect to realism than those without. While longer rollouts give more time to accumulate error, tree search repeatedly prunes problematic rollouts, greatly mitigating this effect. BC+TS has worse curvature JSD than BC but the use of hierarchy prevents mode collapse, enabling BC+TS+H to approach the best of both worlds. MGAIL shows less diversity loss from tree search than BC (only moderately worse minSADE) but also sees much better diversity when hierarchy is used.
To assess whether we can maintain realism when more agents are replaced, we repeat our experiments on the proprietary dataset with eight interactive agents () in both training and testing. Two agents are selected as before and an additional six are selected that are nearest to the ego vehicle and whose distance traveled in the reference trajectory exceeds a threshold. Again, all methods obtain higher values on most metrics, as with the longer rollouts discussed above. Relative performance remains similar, with tree search improving realism but harming diversity and hierarchy improving diversity. In this case, MGAIL sees no loss of diversity from tree search, as any mode collapse already happens in MGAIL training, but still sees substantial diversity improvements when hierarchy is used.
Vi Conclusions & Future Work
This paper presented Symphony, which learns realistic and diverse simulated agents and performs parallel multi-agent simulations with them. Symphony is data driven and combines hierarchical policies with a parallel beam search. Experiments on both open and proprietary Waymo data confirmed that Symphony learns more realistic and diverse behaviour than a number of baselines. Future work will investigate alternative pruning rules to shape simulation to various ends, augmenting goals to model driver persona, and developing additional diversity metrics that capture distributional realism in, e.g., agents’ aggregate pass/yield behaviour.
Apprenticeship learning via inverse reinforcement learning.
Twenty-first international conference on Machine learning - ICML ’04Twenty-first international conference, New York, New York, USA. Cited by: §II-C.
-  (2017) Thinking fast and slow with deep learning and tree search. arXiv preprint arXiv:1705.08439. Cited by: §IV-B.
-  (2009-05) A survey of robot learning from demonstration. Rob. Auton. Syst. 57 (5), pp. 469–483. Cited by: §I.
-  (2018) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. CoRR abs/1812.03079. External Links: Cited by: §IV-C.
-  (2017) End-to-end differentiable adversarial imitation learning. In International Conference on Machine Learning, pp. 390–399. Cited by: §II-C, §IV-B.
-  (2016) Model-based adversarial imitation learning. arXiv preprint arXiv:1612. 02179. Cited by: §IV-B.
-  (2018-11) Learning from demonstration in the wild. External Links: Cited by: §IV-C.
-  (2021) SimNet: learning reactive self-driving simulations from real-world observations. arXiv preprint arXiv:2105.12332. Cited by: §IV-C.
-  (2016) End to end learning for self-driving cars. CoRR abs/1604.07316. External Links: Cited by: §IV-C.
-  (2020) Combining deep reinforcement learning and search for imperfect-information games. arXiv preprint arXiv:2007.13544. Cited by: §IV-B.
-  (2021) NuPlan: A closed-loop ml-based planning benchmark for autonomous vehicles. CoRR abs/2106.11810. External Links: Cited by: §IV-C.
-  (2018) Intentnet: learning to predict intention from raw sensor data. In Conference on Robot Learning, pp. 947–956. Cited by: §III-A.
-  (2019) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. CoRR abs/1910.05449. External Links: Cited by: §III-A.
-  (2021) Decision transformer: reinforcement learning via sequence modeling. External Links: Cited by: §IV-B.
-  (2021) Large scale interactive motion forecasting for autonomous driving : the waymo open motion dataset. CoRR abs/2104.10133. External Links: Cited by: §I, §V-A.
-  (2018) TreeQN and atreec: differentiable tree-structured models for deep reinforcement learning. External Links: Cited by: §IV-B.
-  (1989) Model predictive control: theory and practice—a survey. Automatica 25 (3), pp. 335–348. Cited by: §III.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II-C.
-  (2020) On the role of planning in model-based deep reinforcement learning. External Links: Cited by: §IV-B.
-  (2016-06) Generative adversarial imitation learning. External Links: Cited by: §I, §II-C.
-  (2017-04) Imitation learning: a survey of learning methods. ACM Comput. Surv. 50 (2), pp. 1–35. Cited by: §I.
-  (2021) Reinforcement learning as one big sequence modeling problem. External Links: Cited by: §IV-B.
-  (2020) Strictly batch imitation learning by energy-based distribution matching. arXiv preprint arXiv:2006.14154. Cited by: §IV-A.
-  (2017) Dart: noise injection for robust imitation learning. In Conference on robot learning, pp. 143–156. Cited by: §IV-A.
-  (2018) Gated path planning networks. External Links: Cited by: §IV-B.
-  (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In , pp. 336–345. Cited by: §III-A.
-  (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Cited by: §II-A.
-  (1990) Cognitive models from subcognitive skills. IEE control engineering series 44, pp. 71–99. Cited by: §I, §II-B.
-  (2000-06) Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, pp. 663–670. Cited by: §II-C.
-  (1989-12) ALVINN: an autonomous land vehicle in a neural network. In Advances in neural information processing systems 1, pp. 305–313. Cited by: §I, §II-B, §IV-C.
-  (2007) Bayesian inverse reinforcement learning.. In IJCAI, Vol. 7, pp. 2586–2591. Cited by: §II-C.
A reduction of imitation learning and structured prediction to no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §II-B, §IV-A.
-  (2020-12) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. External Links: Cited by: §IV-B.
-  (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §III.
-  (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. Cited by: §IV-B.
-  (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §IV-B.
-  (2021) TrafficSim: learning to simulate realistic multi-agent behaviors. External Links: Cited by: §IV-C, §V-B.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §IV-B.
-  (2016-02) Value iteration networks. External Links: Cited by: §IV-B.
-  (1996) On-line policy improvement using monte-carlo search. In Proceedings of the 9th International Conference on Neural Information Processing Systems, pp. 1068–1074. Cited by: §III, §IV-B.
-  (2021) AdvSim: generating safety-critical scenarios for self-driving vehicles. CoRR abs/2101.06549. External Links: Cited by: §IV-C.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3), pp. 229–256. Cited by: §II-C.
-  (2020) SMARTS: scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint arXiv:2010.09776. Cited by: §IV-C.
-  (2008) Maximum entropy inverse reinforcement learning. In AAAI, Vol. 8, Chicago, IL, USA, pp. 1433–1438. Cited by: §II-C.