Generalized dynamic cognitive hierarchy models for strategic driving behavior

09/20/2021 ∙ by Atrisha Sarkar, et al. ∙ University of Waterloo 0

While there has been an increasing focus on the use of game theoretic models for autonomous driving, empirical evidence shows that there are still open questions around dealing with the challenges of common knowledge assumptions as well as modeling bounded rationality. To address some of these practical challenges, we develop a framework of generalized dynamic cognitive hierarchy for both modelling naturalistic human driving behavior as well as behavior planning for autonomous vehicles (AV). This framework is built upon a rich model of level-0 behavior through the use of automata strategies, an interpretable notion of bounded rationality through safety and maneuver satisficing, and a robust response for planning. Based on evaluation on two large naturalistic datasets as well as simulation of critical traffic scenarios, we show that i) automata strategies are well suited for level-0 behavior in a dynamic level-k framework, and ii) the proposed robust response to a heterogeneous population of strategic and non-strategic reasoners can be an effective approach for game theoretic planning in AV.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of many challenges of autonomous vehicles (AV) in an urban setting is the safe handling of other human road users who show complex and varied driving behaviors. As AVs integrate into human traffic, there has been a move from a predict-and-plan approach of behavior planning to a more strategic approach where the problem of behavior planning of an AV is set up as a non-zero sum game between road users and the AV; and such models have shown efficacy in simulation of different traffic scenarios fisac2019hierarchical; tian2018adaptive; li2019decision. However, when game theoretic models are evaluated on naturalistic human driving data, studies have shown that human driving behavior is diverse, and therefore developing a unifying framework that can both model the diversity of human driving behavior as well as plan a response from the perspective of an AV is challenging sun2020game; sarkar2021solution.

The first challenge is dealing with common knowledge geanakoplos1992common assumptions. Whether in the case of Nash equilibiria based models (rational agents assuming everyone else is a rational agent) schwarting2019social; Geiger_Straehle_2021, in Stackelberg equilibrium based models (common understanding of the leader-follower relationship) fisac2019hierarchical, or level-k model (the level of reasoning of AVs and humans), there has to be a consensus between an AV planner and other road users on the type of reasoning everyone is engaging in. How to reconcile this assumption with the evidence that human drivers engage in different types of reasoning processes is a key question.

Whereas the challenge of common knowledge deals with questions around how agents model other agents, the second challenge is dealing with bounded rational agents, i.e. modelling sub-optimal behavior in their own response. With the general understanding that human drivers are bounded rational agents, the Quantal level-k model (QLk) has been widely proposed as a model of strategic planning for AV tian2018adaptive; li2018game; li2019decision. One way the QLk model deals with bounded rationality is by mapping each agent into a hierarchy () of bounded cognitive reasoning. In this model, the choice of level-0 model becomes critical, since the behavior of every agent in the hierarchy depends on the assumption about the behavior of level-0 agents. The main models proposed for level-0 behavior include simple obstacle avoidance tian2021anytime, maxmax, and maxmin models sarkar2021solution. Although such elementary models may be acceptable in other domains of application, it is not clear why human drivers, especially in a dynamic game setting, would cognitively bound themselves to such elementary models. Another way QLk model deals with bounded rationality is through the use of a precision parameter, where agents instead of selecting the utility maximizing response, make cost proportional errors modelled by the precision parameter wright2010beyond. While the precision parameter provides a convenient way to fit the model to empirical data, it also runs the risk of being a catch-all for different etiologies of bounded rationality, including, people being indifferent to choices, uncertainty around multiple objectives, as well as the model not being a true reflection of the decision process. This impairs the explainability of the model since it is hard to encode and communicate such disparate origins of bounded rationality through a single parameter.

The primary contribution of our work is a framework that addresses the aforementioned challenges by unifying modeling of heterogeneous human driving behavior with strategic planning for AV. In this framework, behavior models are mapped into three layers of increasing capacity to reason about other agents’ behavior – non-strategic, strategic, and robust. Within each layer, the possibility of different types of behavior models lends support for a population of heterogeneous behavior, with a robust layer on top addressing the problem of behavior planning with a relaxed common knowledge assumptions. Standard level-k type and equilibrium models are nested within this framework, and in the context of those models, secondary contributions of our work are a) the use of automata strategies as a model of level-0 behavior in dynamic games, resulting in behavior that is rich enough to capture naturalistic human driving (dLk() model), and b) an interpretable support for bounded rationality based on different modalities of satisficing — safety and maneuver. Finally, the efficacy of the approach is demonstrated with evaluation on two large naturalistic driving datasets as well as simulation of critical traffic scenarios.

2 Game tree, utilities, and agent types

Figure 1: Schematic representation of the dynamic game. Each node is embedded in a spatio-temporal lattice and nodes are connected with a cubic spline trajectory.

The dynamic game is constructed as a sequence of simultaneous move games played starting at time at a period of secs. over a horizon of secs. Each vehicle ’s state at time

is a vector

representing positional co-ordinates () on , lateral and longitudinal velocity () in the body frame, acceleration (), and yaw (). The nodes of the game tree are the joint system states embedded in a spatio-temporal lattice ziegler2009spatiotemporal, and the actions are cubic spline trajectories kelly2003reactive; al2018spline generated based on kinematic limits of vehicles with respect to bounds on lateral and longitudinal velocity, acceleration, and jerk bae2019toward. A history of the game consists of sequence of nodes traversed by both agents along the game tree until time . We also use a hierarchical approach in the trajectory generation process michon1985critical; fisac2019hierarchical; sarkar2021solution, where at each node, the trajectories are generated with respect to high-level maneuvers, namely, wait and proceed maneuvers. For wait maneuver trajectories, a moving vehicle decelerates (or remain stopped if it is already stopped), and for proceed maneuvers, a vehicle maintains its moving velocity or accelerates to a range of target speeds. Strategies are presented in the behavior strategy form, where is a pure strategy response (a trajectory) of an agent that maps a history to a trajectory in the set , which is the set of valid trajectories that can be generated at the node . The associated maneuver for a trajectory is represented as . Depending on the context where the response depends on only the current node instead of the entire history, we use the notation ; we also drop the time subscript on the history when a formulation holds true for all . The overall strategy of the dynamic game is the cross product of behavior strategies along all possible histories of the game . We use the standard game-theoretic notation of and to refer to an agent and other agents respectively in a game. A glossary of all notations along with specific parameter values is included in the technical appendix.

The utilities in the game are formulated as multi-objective utilities consisting of two components — safety

(modelled as a sigmoidal function that maps the minimum distance gap between trajectories to a utility interval [-1,1]) and

progress (a function that maps the trajectory length in meters to a utility interval [0,1]). In general, these two are the main utilities (often referred to as inhibitory and excitatory utilities, respectively) upon which different driving styles are built sagberg2015review. Agent types are numeric in the range [-1,1] representing each agent’s risk tolerance, with lower values indicating lower tolerance. This construction is motivated by traffic behavior models such as Risk Monitoring Model vaa2011drivers, Task Difficulty Homeostasis theory fuller2008recent, where based on traffic situations, drivers continually compare each possible action to their own risk tolerance threshold and make decisions accordingly. To avoid dealing with complexities that arise out of agent types being continuous, for the scope of this paper, we discretize the types by increments of 0.5 for our experiments. Based on their type (i.e. their risk tolerance) how each agent selects specific actions at each node depends on the particular behavior model, and is elaborated in Sec. 3 when we discuss the specifics of each behavior model.

Unless mentioned otherwise, the utilities (both safety and progress) at a node with associated history are calculated as discounted sum of utilities over the horizon of the game conditioned on the strategy , type and discount factor as follows


are the continuation utilities beyond the horizon of the game and are estimated based on agents continuing on with the same chosen trajectory as undertaken in the last decision node of the game tree for another

seconds, and is a normalization constant to keep the sum of utilities in the same range as the step utilites.

3 Generalized dynamic cognitive hierarchy model.

Figure 2: (a) Organization of models in the generalized dynamic cognitive hierarchy framework. Dashed arrows indicate agents’ belief about the population. (b) Automata (accomodating) and (non-accomodating)

In the generalized dynamic cognitive hierarchy model, agents belong to one of three layers: a) non-strategic, where agents do not reason about the other agents’ strategies, b) strategic, where agents can reason about the strategies of other agents, and c) robust, where agents not only reason about the strategies of other agents but also the behavior model wright2020formal that other agents may be following (Fig. 2a). All of these layers operate in a setting of a dynamic game and we present the models and the solution concepts used in each layer in order.

Non-strategic layer

Similar to level-0 models in the standard level-k reasoning camerer2004cognitive, non-strategic agents form the base of the cognitive hierarchy in our model. However, in our case, we extend the behavior for the dynamic game. The main challenge of constructing the right level-0 model for a dynamic game is that it has to adhere to the formal constraints of non-strategic behavior, i.e. not reason over other agents’ utilities wright2020formal, while at the same time cannot be too elementary for the purpose of modeling human driving, which implies allowing for agents to change their behavior over time in the game based on the game state.

We propose that automata strategies, which were introduced as a way to address bounded rational behavior that results from agents having limited memory capacity to reason over the entire strategy space in a dynamic game rubinstein1986finite; marks1990repeated, address the above problem by striking a balance between adequate sophistication and non-strategic behavior. To this end, we extend the standard level-k model for a dynamic game setting with level-0 behavior mediated by automata strategies (referred to as dLk() henceforth in the paper). In this framework, a level-0 agent has two reactive modes of operation, each reflecting a specific driving style modeled as automata strategies; accomodating () and non-accomodating () (Fig. 2b). An agent playing as type in state randomize uniformly between trajectories belonging to wait maneuver that have step safety utility at least . If no such wait trajectories are available to them, they move to state and randomizes uniformly between trajectories belonging to proceed maneuver. is similar, but the states are reversed thereby resulting in a predisposition that prefers the proceed state. The switching between the states of the automata is mediated by the preference conditions and shown in the equations below.


is true when there is at least one wait trajectory (and conversely, at least one proceed trajectory in the case of ) available whose step safety utility () at the current node is above (and for ). Recall that agent types are in the range [-1,1] and are reflective of risk tolerance.

Along with switching between states, agents are also free to switch between the two automata; however, we leave open the question of modeling the underlying process and the causal factors. We envision them to be non-deterministic and a function of the agent’s affective states, such as impatience, attitude, etc., which are an indispensable component of modeling driving behavior sagberg2015review. For example, one can imagine a left turning vehicle approaching the intersection starting with an accommodating strategy, but along the game play, changes its strategy to non-accommodating on account of impatience or other endogenous factors. Although leaving open the choice of mode switching leads to the level-0 model in this paradigm to be partly descriptive rather than predictive, as we will see in the next section, such a choice does not compromise the ability of higher level agents’ to form consistent beliefs about the level-0 agent and respond accordingly to their strategies. This richer model of level-0 behavior not only imparts more realism in the context of human driving, but also allows for level-0 agent to adapt their behavior based on the game situation in a dynamic game.

Strategic layer

The difference between strategic and non-strategic models is that the latter adhere to the two properties of other responsiveness, i.e. reasoning over the utilities of the other agents, and dominance responsiveness, i.e. optimizing over their own utilities wright2020formal. Agents in the strategic layer in our framework adhere to the above two properties and we include three models of behavior, namely, level-k () behavior based on the dLk() model and two types of bounded rational equilibrium behavior, i.e. SSPE (Safety satisfied perfect equilibria) and MSPE (Maneuver satisfied perfect equilibria). We note that the choice of models in the strategic layer is not exhaustive, however, we select popular game-theoretic models (level-k and equilibrium based) that have been used in the context of autonomous driving and address some of the gaps within the use of those models as secondary contributions.

For the strategic models to use the two properties and compute a game strategy, the two multi-objective utilities presented in Sec. 2 need to be combined into one utility. To that end, we use a lexicographic thresholding approach to combine the two multi-objective utilities LiChangjian19, and furthermore, we connect the lexicographic threshold to an agent’s type . Specifically, the combined utility is equal to , i.e., the safety utility when , and otherwise , i.e., the progress utility. The threshold is negated to align lower risk tolerance with lower value of . In the following sections, we also index the agent types with the specific models for clarity.

dLk() model ( behavior)

A level-1 agent 111We focus on behavior in this section as well as later in the experiments, but the best response behavior can be extended to similar to a standard level-k model. believes that the population consists solely of level-0 agents, and generates a best response to those (level-0) strategies. In a dynamic setting, however, a level-1 agent has to update it’s belief about level-0 agent based on observation of the game play. This means that in order to best respond to level-0 strategy, a level-1 agent should form a consistent belief based on the observed history of the game play about the type () a level-0 agent plays in each automaton.

Definition 1.

A belief of a level-1 agent at history is consistent iff , and , where , , and (dropping the superscripts) is the trace of the automaton, defined as the set of all valid sequence of actions generated by an level-0 agent playing an automaton as type and encountering the nodes .

Before the level-1 agent has observed any action by the level-0 agent, their estimates for and is the default set of types, i.e. with range 2 (recall ). However, over time with more observations of level-0 actions, level-1 agent forms a tighter estimate, i.e. , of level-0’s type. The following theorem formulates the set of consistent beliefs based on observed level-0 actions with history more formally.

Theorem 1.

For any consistent belief and , where and the following inequalities hold true for any history of the game,

where and are set of game nodes where level-0 agent chose a proceed and wait maneuver respectively, and , are the available trajectories at node belonging to the two maneuvers, proceed and wait, respectively.


There are two parts to the equation, one for and another . We prove the bounds of corresponding to automata , and the proof for corresponding to automata follows in an identical manner.

By construction of automata , proceed trajectories are only generated in state , which follows one of the three transitions . Therefore, , for the transition to happen. Based on eqn. 1, this means that ,


Since stays constant throughout the play of the game, for to be consistent for the set of all nodes , the lower bound (i.e., at least as high to be true for all nodes ) of based on eqn A.1 is


Similarly, by construction of automata, wait trajectories are only generated in state , which follows one of the three transitions . Therefore, , for the transition to happen. Based on eqn. 1, this means that ,


and the upper bound (i.e., at least as low to be true for all nodes ) of based on eqn A.3


Since and , equations A.2 and A.4 in conjunction proves the case for bounds.

The proof for follows in the identical manner as , but with the condition reversed based on the P and W states. ∎

The above theorem formalizes the idea that looking at maneuver choices made by a level-0 agent at each node in the history, as well as the range of step safety utility at that node for both maneuvers (recall that preference conditions of the automata are based on step safety utility), a level-1 agent can calculate ranges for from Eqns 1 and 2, for which the observed actions were consistent with the corresponding automata. With respect to the set of consistent belief about level-0 agent’s strategy, level-1 agent now needs to generate a best response that is consistent with . Dropping the AC/NAC superscripts Let be the union of all actions when the automata is played by the types in , then is the set of all actions that level-0 agent can play based on level-1’s consistent belief . The response to those actions by level-1 agent (indexed as ) is as follows.


where is the type of the level-1 agent. Note that the strategy of the level-1 agent, unlike level-0 agent depends on the history instead of just the state of the node ; since the history influences the belief which in turn influences the response.

Equilibrium models.

Along with the level-k (k) behavior, another notion of strategic behavior that have been proposed as a model of behavior planning in AV pruekprasert2019decision; schwarting2019social are based on an equilibrium. However, when it comes to modelling human driving behavior, an equilibrium model needs to accommodate bounded rational agents in a principled manner, and ideally should provide a reasonable explanation of the origin of the bounded rationality. Based on the idea that drivers are indifferent as long as an action achieves their own subjective risk tolerance threshold lewis2012testing, we use the idea of satisficing as the main framework for bounded rationality in our equilibirum models stirling2003satisficing. Specifically, we develop two notions of satisficing; one based on safety satisficing (SSPE), where agents choose actions close to the Nash equilibria as long as the actions are above their own risk tolerance threshold, and another based on maneuver satisficing (MSPE), where agents chose actions close to the Nash equilibria as long as the actions are of the same high-level maneuver as the optimal action.

Safety-satisfied perfect equilibrium (SSPE).

The main idea behind satisficing is that a bounded rational agent, instead of always selecting the best response action, selects a response that is good enough, where good enough is defined as an aspiration level where the agent is indifferent between the optimal response and the response in question. In the case of Safety-satisfied perfect equilibrium (SSPE), we define a response good enough for agent if the response is above their own safety threshold determined by their type. A more formal definition is as follows

Definition 2.

A strategy , is in safety satisfied perfect equilibria for a combination of agent types if for every history of the game and

where , is a subgame perfect Nash equilibrium in pure strategies of the game for agents with type and . 222We calculate the SPNE using backward induction with the combined utilities (lexicographic thresholding) in a complete information setting where agent types are known to each other.

Based on the above definition, if the safety utility of the best response of agent to agent ’s subgame perfect Nash equilibrium (SPNE) strategies at history is less than agent ’s own safety threshold as expressed by their type , then the SSPE response is any trajectory that matches the safety utility of the SPNE response. However, if the SPNE response is higher than their safety threshold, then any suboptimal response that has safety utility higher than is a satisfied response, and thus in SSPE.

Maneuver-satisfied perfect equilibrium (MSPE).

In this model of satisficing, agents choose actions that belong to the same maneuver as that of the equilibrium action, with some additional constraints. Illustrated with a simple example, at any node, if the trajectory of the equilibrium action belong to wait maneuver, then all trajectories belonging to wait maneuver will be in MSPE. However, to avoid selection of the wait trajectories that have utility lower than a non-equilibrium maneuver (in this case, a proceed trajectory), we add the constraint that the utility of a MSPE (in this case, a wait trajectory) has to be higher than that of all proceed trajectories. A formal definition is as follows.

Definition 3.

A strategy , is in maneuver satisfied perfect equilibria for a combination of agent types if for every history of the game and , and

where , is a subgame perfect equilibrium in pure strategies of the game with agent types , and or in other words, the set of available trajectories at node that do not belong to the maneuver corresponding to the equilibrium trajectory and is the last node in the history .

Robust layer

While the presence of multiple models in the strategic layers allow for a population of heterogeneous reasoners, an agent following one of those models still has specific assumptions about the reasoning process of other agents, e.g. level-1 agents believing that the population consists of level-0 agents and equilibrium responders believing that other agents adhere to a common knowledge of rationality. Based on the position that a planner for AV should not hold such strict assumptions, we develop the robust layer. What differentiates the robust layer from the strategic layer is that along with the two properties of other responsiveness and dominance responsiveness for the strategic layer, agents in the robust layer also adhere to the property of model responsiveness, i.e., the ability to reason over the behavior models of other agents. This gives them the ability to reason about (forming beliefs about and responding to) a population of different types of reasoners including strategic, non-strategic, as well as agents following different models within each layer. The overall behavior of a robust agent can be broken down into three sequential steps as follows.
i. Type expansion: Since the robust agent not only has to reason over the types of other agents, but also the possible behavior models, we augment the initial set of agent types that were based on agents’ risk tolerance with the corresponding agent models. Let be the augmented type of an agent, where is the set of models presented earlier, i.e., {accomodating, non-accomodating, level-1, SSPE, MSPE} and is the (non-augmented) agent type agent type () within each model.

ii. Consistent beliefs: Similar to strategic agents, based on the observed history of the game, a robust agent forms a belief such that the observed actions of the other agent in the history is consistent with the augmented types (i.e., model as well as the agent type) in . The process of checking whether a history is consistent with a combination of a model and agent type was already developed earlier for two non-strategic models (Def. 1). For level-1 models, the history is consistent if at each node in history, the response of the other agent adheres to equation 3; and for the equilibrium models, a history is consistent if based on definitions 2 and 3 for SSPE and MSPE respectively, the actions are along the equilibrium paths of the game tree. Assuming that in driving situations agents behave truly according to their types, is then constructed as an union of all the consistent beliefs for each model.

iii. Robust response:

The idea of a robust response to heterogeneous models is along the lines of the robust game theory approach of

aghassi2006robust. The belief set represents the uncertainty over the possible models the other agents’ may be using along with the corresponding agent types within those models. A robust response to that is the optimization over the worst possible outcome that can happen with respect to that set. Eqn. 4 formulates this response of a robust agent playing as agent .


where are the possible actions of the other agent based on the augmented type and is the robust agent’s own type. In this response, the minimization happens over the agent types (inner operator), rather than over all the actions as is common in a maxmin response. Since driving situations are not, in most cases, purely adversarial, this is a less conservative, yet robust, response compared to a maxmin response.

4 Experiments and evaluation

(a) Snapshot of naturalistic datasets (WMA and inD)
(b) Simulation of critical scenarios: intersection clearance, merge before intersection, parking pullout.
Figure 3: Evaluation setups

In this section we present the evaluation of the models under two different experiment setups. First, we compare the models with respect to large naturalistic observational driving data using a) the Intersection dataset from the Waterloo multi-agent traffic dataset (WMA) recorded at a busy Canadian intersection sarkar2021solution, and b) the inD dataset recorded at intersections in Germany inDdataset (Fig. 2(a)). From both datasets, which include around 10k vehicles in total, we extract the long duration unprotected left turn (LT) and right turn (RT) scenarios, and instantiate games between left (and right) turning vehicles and oncoming vehicles with s and s, resulting in a total of 1678 games. The second part of the evaluation is based on simulation of three critical traffic scenarios derived from the NHTSA pre-crash database najm2007pre, where we instantiate agents with a range of risk tolerances as well as initial game states, and evaluate the outcome of the game based on each model. All the games in the experiments are 2 agent games with the exception of one of the simulation of critical scenario (intersection clearance), which is a 3 agent game.

Baselines. We select multiple baselines depending on whether a model is strategic or non-strategic. For non-strategic models, we compare the automata based model with a maxmax model, shown to be most promising from a set of alternate elementary models with respect to naturalistic data sarkar2021solution. For the strategic models (level-1 in dLk(), SSPE, MSPE), we select a QLk model used in multiple works within the context of autonomous driving li2018game; tian2018adaptive; tian2021anytime; li2019decision. We use the same parameters used in tian2021anytime for the precision parameters in the QLk model.

Naturalistic data

We evaluate the model performance on naturalistic driving data based on accuracy measure, i.e. the number of games where the observed strategy of the human driver matched a model’s strategy divided by the total number of games in the dataset. More formally, let be the set of games in the dataset, an indicator function is 1 if in the game , there exists a combination of agent types (), such that the observed strategy is in the set of strategies as predicted by the model, or 0 otherwise. The overall accuracy of a model is given by

. QLk (baseline) models being mixed strategy models, we count a match if the observed strategy is assigned a probability of


LT (1103) RT (311) LT (187) RT (77)
maxmax 0.33438 (-0.02) 0.43023 (-0.27) 0.37975 (0.2) 0.43506 (-0.1)
AC 0.82053 (-0.82) 0.90698 (-0.84) 0.92089 (-0.75) 0.81818 (-0.79)
NAC 0.17947 (-0.87) 0.09302 (-0.82) 0.07911 (-0.88) 0.18182 (-0.85)
QLk(=1) 0.18262 (0.07) 0.43265 (-0.33) 0.37658 (-0.37) 0.43506 (0.1)
QLk(=0.5) 0.34131 (0.03) 0.43023 (-0.26) 0.37658 (-0.37) 0.43506 (0.1)
dLk() 0.5529 (-0.19) 0.65449 (-0.3) 0.51266 (-0.51) 0.53247 (0.4)
SSPE 0.69144 (0.84) 0.90033 (0.94) 0.6962 (0.86) 0.53247 (0.85)
MSPE 0.30479 (0.1) 0.44518 (0.13) 0.21519 (-0.21) 0.27273 (0.7)
Robust 0.56045 (-0.20) 0.66944 (-0.34) 0.51582 (-0.51) 0.53247 (0.4)
Table 1: Overall accuracy of the models for each dataset and scenario. Mean agent type () noted in parenthesis. LT: Left turn, RT: Right turn. Number of games noted in the header.

Table 1 shows the accuracy of each model for each dataset and scenario. It also shows in parenthesis the mean , i.e. the agent type value for each model when the strategy matched the observation. The overall numbers in the table are in line with the converging consensus from recent literature that there is heterogeneity in driving behavior sun2020game; sarkar2021solution. However, a major takeaway is that for non-strategic models, automata models show much higher accuracy thereby reflecting high alignment with human driving behavior compared to the maxmax model. In fact, as we can see from the table that the entries for AC and NAC sum up to 1, combination of AC and NAC although being non-strategic, can capture all observed driving decisions in the dataset, which indicate that automata models are very well suited for modelling level-0 behavior in a dynamic game setting for human driving.

For the strategic models, dLK() and SSPE model shows better performance than QLk and MSPE models. However, when we compare the mean agent types, we observe that when SSPE strategies matches the observation, it is based on agents with very high risk tolerance (reflected in high mean agent type values). If we assume that the population of drivers on average have moderate risk tolerance, say in the range [-0.5,0.5] (estimating the true distribution is out of the current scope), dLk() is a more reasonable model of strategic behavior compared to SSPE. We include the robust model comparison for the sake of completeness (and it shows performance comparable to dLk() model), but as mentioned earlier, robust model is a model of response of an AV, and therefore ideally needs to be evaluated on criteria beyond just comparison to naturalistic human driving, which we discuss in the next section.

Critical scenarios

While evaluation based on a naturalistic driving dataset helps in the understanding of how well a model matches human driving behavior, in order to evaluate the suitability of a model for behavior planning of an AV, the models need to be evaluated on specific scenarios that encompass the operational design domain (ODD) of the AV

ilievski2020wisebench. Since the models developed in this paper are not specific to an ODD, we select three critical scenarios from ten most frequent crash scenarios in the NHTSA pre-crash database najm2007pre.

Intersection clearance (IC):

Left turn across path (LTAP) scenario where the traffic signal has just turned from green to yellow at the moment of the game initiation. There is a left turning vehicle is on the intersection and two oncoming vehicles from the opposite direction close to the intersection who may chose to speed and cross or wait for the next cycle. The expectation is that the left turning vehicle should be able to clear the intersection by the end of the game horizon without crashing into either oncoming vehicles, and no vehicles should be stuck in the middle of the intersection.

Merge before intersection (MBI): Merge scenario where a left-turning vehicle (designated as the merging vehicle) finds itself in the wrong lane just prior to entering the intersection, and wants to merge into the turn lane in front of another left-turning vehicle (designated as the on-lane vehicle). The expectation is that the on-lane vehicle should allow the other vehicle to merge.

Parking pullout (PP): Merge scenario where a parked vehicle is pulling out of a parking spot and merges into traffic while there is a vehicle coming along the same direction from behind. The expectation is that the parked vehicle should wait for the coming vehicle to pass before merging into traffic.

For each scenario, we run simulations with a range of approach speeds as well as all combination of agent types from the set of agent types (parameters and simulation videos are included in the supplementary material).

Figure 4: Mean and SD of success for each model in each scenario across all agent types.

One way to compare the models is to evaluate them based on the mean success rate across all initiating states and agent types. Fig. 4 shows the mean success rate (success defined as the desired outcome based on expectation defined in the description for each scenario) for all the strategic and robust models. We see that the mean success rate of the robust and dLk(A) model is higher compared to the equilibrium models or the QLk model. However, this is only part of the story. With varying initiation conditions, it may be harder or easier for a model to lead to a successful outcome. For example, in the parking pullout scenario a high risk tolerant vehicle coming at a higher speed is almost likely to succeed in all models when facing a low risk tolerant parked vehicle at zero speed. Therefore, to tease out the stability of models across different risk tolerance (i.e. agent type combinations), Fig. 4

also plots on y-axis, the standard deviation of the mean success rate across different agent types. Ideally, a model should have a high success rate with low SD across types indicating that with different combinations of agent type population (from extremely low risk tolerance to very high), the success rate stays stable. As we see from Fig.


, the robust and dLk(A) models are broadly in the ideal lower right quadrant (high mean success rate low SD) for parking pullout and merge before intersection scenarios. For IC scenario, however, we observe that the success rate comes at a price of high SD (for all models) as indicated by the linearly increasing relation between mean success rate and its SD across agent types. This means that the success outcomes are skewed towards a specific combination of agent types; specifically, the case where the left turning vehicle has high risk tolerance. It is intuitive to imagine that in a situation like IC, agents with low risk tolerance would be stuck in the intersection instead of being able to navigate out of the intersection quickly.

Finally, the failure of models to achieve the expected outcome can also be due to a crash (minimum distance gap between trajectories m) instead of an alternate outcome (e.g. getting stuck in the intersection). In all the simulations, for the parking pullout and intersection clearance we did not observe a crash for any of the models. However, for the merge before intersection, on account of starting at a more riskier situation than the other two in terms of chance of a crash, the crash rate (ratio of crashes across all simulations) for the models across all initial states and agent types were as follows: (dlk(): 0.052, MSPE: 0.022, SSPE: 0.007, QLk(1): 0.026, Robust: 0.053).

Overall, whether or not an AV planner can succeed in their desired outcome depends on a variety of factors, such as, the assumption the vehicle and the human drivers hold over each other, the risk tolerance of each agent, as well as the specific state of the traffic situation. The analysis presented above helps in quantifying the relation between the desired outcome and the criteria under which it is possible.

5 Conclusion

We developed a unifying framework of modeling human driving behavior and strategic behavior planning of AVs that support heterogeneous models of strategic and non-strategic behavior. We also extended the standard level-k model into the dynamic form with a sophisticated yet within non-strategic constraints of level-0 behavior through the use of automata strategies (dLk()). The evaluation on two large naturalistic datasets attests to the consensus that there is diversity in human driving behavior; however, a combination of a rich level-0 behavior can capture most of the driving behavior as observed in naturalistic data. On the other hand, with the awareness that there can be different types of reasoners in the population, an approach of robust response is not only effective, but also is stable across a population of drivers with different levels of risk tolerance.