Autonomous systems are becoming more capable, better accepted, and more commonplace. Many autonomous systems, including collaborative robots  and self-driving cars , operate in dynamic and interactive environments. As an example, a self-driving car may operate in traffic with multiple other cars, which inform the environment. The environment is dynamic and interactive because the other cars not only operate concurrently with the ego car but also respond to its actions .
Decision making for an autonomous system operating in a dynamic and interactive environment needs to take the interactions between the system and its environment into account. For autonomous systems interacting with humans, it is especially important to account for these interactions to ensure safety. Despite much progress made, this problem remains challenging and unsolved in many application scenarios [18, 8].
Game theory is a useful tool for modeling strategic interactions between intelligent agents . Among various game theoretic frameworks, cognitive hierarchy theory (CHT) has drawn attention from game theorists and practitioners since the 90’s [17, 19] due to improved accuracy in predicting human behavior compared to equilibrium-based theories in many experimental studies [6, 5]. CHT describes human thought processes in strategic games by characterizing human behavior based on levels of iterated rationalizability. In particular, it assumes bounded rationality of decision makers in contrast to the assumption of unbounded/perfect rationality in many equilibrium-based theories. The assumption of bounded rationality can be more realistic than that of unbounded rationality in many practical situations because the reasoning capability of a decision maker is often limited by the complexity of the decision problem and the time available to make a decision .
In this paper, we describe a framework based on CHT for autonomous system/intelligent agent decision making when operating in a dynamic and interactive environment. The framework synergizes CHT, Bayesian inference, and receding-horizon optimal control to solve for decision strategies. In particular, hard constraints are probabilistically enforced over the planning horizon to incorporate safety requirements. The interactive decision making process is formulated as a constrained partially observable Markov decision process, and a recently developed algorithm[13, 11] is applied to solve it.
We note that although CHT has been utilized for modeling multi-agent interactions in the literature [14, 22, 15, 10], most of the existing works exploit the “level-k” framework [17, 19], which assumes that a level- decision maker treats the interactive environment as a level-() decision maker and responds to it accordingly. Unfortunately, level- decision rules may lead to poor decisions if the ()-assumption about the interactive environment’s cognitive level is incorrect .
In this paper, we consider a different framework called the “cognitive hierarchy” framework , where CH- decisions are optimized to strategically respond to the interactive environment by modeling it using a mixture of level-, , decision-maker models. The cognitive hierarchy framework enables the autonomous agent to strategically interact with environments with different cognitive levels. When the autonomous agent has good knowledge about its operating environment, the mixing ratio of level- models could be pre-specified 
. When operating in an uncertain environment, reasoning about the interactive environment’s cognitive level could be incorporated in the decision making process, which is one of the key developments of this paper. Although heuristic techniques for estimating the environment’s cognitive level similar in spirit to the level reasoning algorithm in this paper have been proposed in[10, 12, 21] for specific applications, the approach in this paper is based on Bayesian inference, hence having a firmer theoretical foundation, and has broader applicability.
This paper is organized as follows: In Section II, we formulate decision making in a dynamic and interactive environment as a dynamic game. In Section III, we review two related but distinct frameworks of cognitive hierarchy theory, the level-k framework and the cognitive hierarchy framework. In Section IV, we describe the decision making process based on cognitive hierarchy theory and its solution method based on a constrained partially observable Markov decision process formulation. In Section V, we apply the proposed decision making algorithm to autonomous vehicle control in three interactive traffic scenarios. Discussion and conclusions are given in Section VI.
Ii Problem formulation
In this paper, we consider a decision making process by an intelligent agent operating in a dynamic and interactive environment. The interactions between the ego agent and the environment are modeled as a two-player dynamic game represented as a 6-tuple,
where represent the two players with denoting the ego agent and denoting the environment; is a finite set of states with denoting the state of the agent-environment system at the discrete time instant ; is a finite set of actions with denoting the action set of the ego agent and denoting the action set of the environment; represents a transition of the state as a result of an action pair , in particular, is defined by the following dynamic model,
are reward functions of the two players representing their decision objectives, in particular,
i.e., each player’s reward at the time instant depends on the state and both players’ actions ; and with being a set of “safe” states, representing hard constraints for decision making of the ego agent.
We let the ego agent make decisions based on a receding-horizon optimization of the form,
where denotes a prediction of player ’s action at the time instant over a planning horizon of length with the prediction made at the current time instant , denotes a prediction of the system state under the sequence of action pairs , and is a factor to discount future rewards.
Clearly, the above optimization problem is not well-defined yet, as the uncontrolled variables are unknown. One way to proceed is to consider worst-case scenarios, i.e., replace (4a) with
However, as (5) assumes an adversarial player , rather than a rational player that pursues its own objectives and is not necessarily against the ego agent, (5) may lead to overly-conservative decisions for the ego agent. Therefore, we pursue an alternative solution, which is based on cognitive hierarchy theory (CHT) and is described in the next section.
Iii Two cognitive hierarchy frameworks
Cognitive hierarchy theory (CHT) is concerned with behavioral models describing human thought processes in strategic games. It characterizes human behavior based on levels of iterated rationalizability. Two frameworks have been developed based on CHT: the level-k framework [17, 19] and the cognitive hierarchy framework , which are closely related but have distinct features. They are reviewed in what follows.
Iii-a The level-k framework
In the level-k framework, it is assumed that each player in a strategic game bases its decision on a finite depth of reasoning about the likely actions of the other players, which is referred to as its “cognitive level.” In particular, the reasoning hierarchy starts from some non-strategic behavioral model, called level-. Then, a level- player, , assumes that all of the other players are level-(), based on which it predicts the actions of the other players and makes its own decision as the optimal response to these predicted actions. In short, a level- decision optimally responds to level-() decisions. Note, however, that a level- decision may turn out to be poor if the other players execute level- decisions with .
Iii-B The cognitive hierarchy framework
The cognitive hierarchy (CH) framework is similar to the level-k framework in terms of characterizing each player’s behavior also by a bounded cognitive level . The unique feature of the CH framework is the hypothesis that a player can act under the assumption that some percentage of the population fits each archetype. More specifically, a CH- player assumes that each of the other players is level- for some and optimizes its decision corresponding to its beliefs about the other players’ levels. This feature makes a CH- player “smarter” than a level- player by enabling a CH- player to optimally respond to level- decisions for all as long as it has the correct beliefs about the other players’ levels.
The level-k framework has been exploited by various researchers for modeling human-human, human-machine, and machine-machine111On the basis of the fact that many machine systems pursue human-like decision making, e.g., in . interactions in automotive systems [14, 12, 21], aerospace systems [22, 15], as well as in cyber-physical security . It has been revealed in  that an ego agent using a decision strategy corresponding to a level- model with some fixed may behave poorly when interacting with another agent using a decision strategy of level- with . On the basis of this observation, algorithms that estimate the level of the other agent according to its historical behavior have been proposed in [10, 12, 21] so that the ego agent can adapt its decision strategy to the level estimate. In the next section, we present a rigorous formulation of a decision making process based on the CH framework, motivated, in part, by preliminary developments in [10, 12, 21]. We then recast the problem as a partially observable Markov decision process (POMDP). Compared to the level estimation algorithms in [10, 12, 21]
, which are rather heuristic although shown to be effective in specific applications, the decision making framework introduced in the following is based on Bayesian statistics and hence has a firmer theoretical foundation.
Iv Decision making based on CHT and its POMDP-based solution method
Iv-a Level-k models of the environment
A policy , , is a stochastic map from states to actions . Specifically, is such that
for all , where
denotes conditional probabilities.
To define the level- models of the environment for arbitrary , we start from defining a level- model of the ego agent, defined by a policy , and a level- model of the environment, defined by a policy . The level- model of the environment, , with is then constructed based on the “softmax decision rule” , which captures the suboptimality and variability in decision making [3, 2], as follows,
in which the -function of state-action pairs is defined as
where is the level-() model of the ego agent, which for is defined as
In short, the level- model of the environment is constructed based on the level-() model of the ego agent, which is constructed based on the level-() model of the environment. Therefore, the level- models of the environment as well as the level- models of the ego agent are constructed recursively for . We note that when constructing the level- models of the ego agent, we drop the hard constraints (4b) to reduce computational complexity but can promote their satisfaction through imposing penalties in the reward function .
Iv-B Ego decision making based on the CH framework
After the level- models of the environment for have been constructed, we define the augmented state of the agent-environment system as , where represents the actual cognitive level of the environment and is assumed to be unknown to the ego agent. Then, we consider the following augmented dynamic model of the agent-environment system,
where is referred to as an “observation.” On the basis of the level- policies, the action of the environment, in (11a), can be viewed as a stochastic disturbance satisfying
for all and .
It is assumed that the ego agent has a prior belief about
, as a probability distribution,, defined over . Let us collect all historical observations up to time and all previously executed actions by the ego agent up to time
into a data vector,
which, roughly speaking, will be used by the ego agent as evidence to infer the actual cognitive level of the environment, i.e., to obtain the posterior belief about , , defined for all .
Then, we consider the following decision making process by the ego agent,
where defines a required level of confidence in constraint satisfaction.
Comparing the processes (4) and (14), we can observe two major differences: Firstly, the unknowns in (4) have been modeled as stochastic disturbances in (14), which is achieved in (12) by exploiting the level- models of the environment . Secondly, to account for the stochasticities, the objective has been changed from maximizing the value of a function in (4a) to maximizing the expected value of the function in (14a), and the hard constraint (4b) has been changed to a probabilistic requirement of satisfaction, i.e., the chance constraint (14b) with being a design parameter. We note that (14) is a well-defined optimization problem [13, 11]. We present a solution method for it in the following section.
Iv-C POMDP-based solution method
Firstly, we define , , as a probability distribution over the set , based on which the predicted action is chosen, i.e.,
Then, we reformulate (14) as the following optimization problem:
where is the ()-dimensional probability simplex.
The problem (14) along with its transformation (16) is referred to as a partially observable Markov decision process (POMDP) with a time-joint chance constraint, where the partial observability comes from the unobservability of the hidden state . A solution method for general problems in the form of (16) has been described in [13, 11]. In particular, the following Propositions 1 and 2, which represent matrix-computational implementations of the corresponding mathematical expressions in [13, 11] applied to the specific problem setting of this paper, are used in solving (16).
Proposition 1: Suppose that the reward function can be written as , where represents the next state transitioned from through the dynamic model (2). Then, for any given , it holds that
where is a vector collecting the reward values associated with every , in which , and is a vector representing the predicted distribution of the augmented state.
In particular, the reward vector is constructed offline and the distribution vector is computed online using the recursive formula,
in which denotes the
-dimensional identity matrix,denotes the -dimensional all-ones vector, represents the Kronecker product, and
with representing the transition kernel of the augmented state, constructed offline as
with denoting the set-membership indicator function.
The recursive formula (18) starts with the initial term , the posterior belief about the augmented state inferred according to the evidence , which is updated at every decision step using the Bayesian inference formula,
Proposition 2: For any given , the left-hand side of the constraint (16b) can be evaluated using the following algorithm:
Initialize , , and .
If then go to Step 5); otherwise update
and go to Step 2).
On the basis of Propositions 1 and 2, a standard nonlinear programming solver exploiting gradient and Hessian information of the cost and constraint functions of (16), which can be numerically estimated based on function evaluations for any of interest, can be used to solve for .
V Application to autonomous driving in interactive traffic scenarios
In the near to medium term, autonomous vehicles will operate in traffic together with human-driven vehicles. Ensuring safety in the associated interactive traffic scenarios remains a challenging problem for autonomous vehicle control . In this section, we apply the decision making framework based on cognitive hierarchy theory described in the previous sections to controlling an autonomous ego vehicle in various traffic scenarios where it needs to interact with a human-driven vehicle. The traffic scenarios we consider include a four-way intersection scenario, a highway overtaking scenario, and a highway forced merging scenario.
Decision making of the human-driven vehicle is modeled based on the level-k framework described in Section III-A. Experimental studies [6, 5] suggest that humans are most commonly level- and level- reasoners. Therefore, we consider level- and level- models in the form of (7). Note that different human drivers may have different cognitive levels, and the autonomous ego vehicle does not know in advance the specific level, , of the human driver it is interacting with but has to infer based on its observed information. When without any information at step , we initialize the ego vehicle’s beliefs in level- and level- models of the human-driven vehicle as and .
In all of the three traffic scenarios discussed in this section, we use the following discrete-time model to represent vehicle kinematics in the longitudinal direction,
where denotes position, denotes velocity, denotes acceleration, the subscript represents the discrete time, the first superscript distinguishes the autonomous ego vehicle from the human-driven vehicle , the second superscript denotes the or -direction, and [s] is the sampling period. We model lane changes as instantaneous events, i.e., completed in one time step. The acceleration , taking values in a finite acceleration set , and the lane change command are the actions to be decided on.
As described in Section III-A, in order to formulate the level- models of the human-driven vehicle, , we need to define the level- models of both vehicles, and . Following [12, 21], we let a level- vehicle select actions to maximize the same reward function as that for level- and vehicles but treat other vehicles on road as stationary obstacles. Note that “as stationary obstacles” defines a way to predict the other vehicle’s actions in the decision making process (4), so the ego vehicle’s optimal actions, and hence the level-0 policy, can be determined.
As shown in Fig. 1, the autonomous ego vehicle (blue car) encounters a human-driven vehicle (red car) at an unsignalized four-way intersection. Both vehicles are driving straight through the intersection. Such an objective is represented by the following reward function,
where for the autonomous ego vehicle and for the human-driven vehicle.
In the formulated receding-horizon optimization problem in the form of (4), we choose the planning horizon as .
Moreover, given the safety requirement to maintain the positions in the safe set,
where , represents the Euclidean norm, and [m] is the car length, we impose the following chance constraint over the planning horizon,
Fig. 1(a-1) and (a-2) show two subsequent steps in the simulation of autonomous ego vehicle interacting with a level- human-driven vehicle, and Fig. 1(b-1) and (b-2) show those of interacting with a level- human-driven vehicle. When interacting with a level- human-driven vehicle, which, on the basis of our level- model introduced above, represents a cautious/conservative driver, the autonomous ego vehicle decides to drive through the intersection first. When interacting with a level- human-driven vehicle (aggressive, based on our specified level- model), the autonomous ego vehicle yields the right of way to the human-driven vehicle. The autonomous ego vehicle responds to the two different human drivers in different ways because it gains knowledge of the human driver’s cognitive level by observing his/her actions for the first few steps after which it can predict his/her future actions and respond optimally.
The second scenario we consider is shown in Fig. 2, where the autonomous ego vehicle (blue car) is overtaking a human-driven vehicle (red car). A similar scenario has been considered in [13, 11] but not in a game theoretic formulation.
We consider the following reward function,
where the first term is used to encourage the autonomous ego vehicle to overtake the human-driven vehicle, and the second term is used to penalize the autonomous ego vehicle for driving in the left passing lane so that it is encouraged to come back to the right traveling lane after the overtaking as quickly as it can.
We choose the planning horizon as and impose a chance constraint over the planning horizon in the form of (25) where the safe set is now defined as
in which [m] is the lane width. The safe set (27) represents the requirement that overtaking can occur only when the two vehicles are traveling in different lanes, otherwise they shall keep a reasonable distance in the longitudinal direction to improve safety.
Fig. 2(a-1)-(a-4) show four subsequent steps in the simulation of autonomous ego vehicle interacting with a level- human-driven vehicle, and Fig. 2(b-1)-(b-4) show those of interacting with a level- human-driven vehicle. We note that in this simulation the maximum speed of the human-driven vehicle is restricted to be smaller than that of the autonomous ego vehicle to ensure the possibility of an overtaking. When interacting with a level- driver, the autonomous ego vehicle completes the overtaking relatively quickly because, as can be seen in Fig. 2(a-2), the level- driver drives slowly to let it cut in. When interacting with a level- driver, the autonomous ego vehicle needs to drive in the passing lane for a longer period of time before it can come back to the traveling lane.
The last scenario we consider is a highway forced merging scenario. Differently from overtaking, which may improve travel speed but is usually unnecessary, merging, oftentimes, has to be accomplished within a certain road section. We consider the scenario shown in Fig. 3, where the autonomous ego vehicle (blue car) originally driving in the right lane needs to merge into the traffic in the left lane. In particular, the merging can only be and has to be accomplished within the road section with the grey-dashed lane marking.
We consider the reward function
in the receding-horizon optimization (14), where the first term is used to encourage the autonomous ego vehicle to maintain a reasonable travel speed and the second term is used to encourage it to merge into the left lane. For the safe set, we choose
Subsequent steps in the simulation with a level- human-driven vehicle (red car) traveling in the left lane are shown in Fig. 3(a-1)-(a-4), and those with a level- human-driven vehicle are in Fig. 3(b-1)-(b-4). When the human-driven vehicle is level-, which, on the basis of our level- model introduced at the beginning of Section V, represents a cautious/conservative driver, the autonomous ego vehicle decides to merge into the left lane ahead of the human-driven vehicle. When the human-driven vehicle is level-, which represents an aggressive driver, the autonomous ego vehicle merges behind the human-driven vehicle as it predicts that the human-driven vehicle will likely not yield.
Vi Discussion and Conclusions
In this paper, we described a framework synergizing cognitive behavioral models, Bayesian inference, and receding-horizon optimal control for autonomous decision making in a dynamic and interactive environment with uncertainty.
In the current version of the framework, the environment, which responds to the ego agent’s actions, is modeled as a single intelligent agent with a certain cognitive level . Simulation examples representing traffic scenarios where an autonomous ego vehicle interacts with a human-driven vehicle illustrate the application of the current version of the framework. When the environment is composed of multiple intelligent agents, the proposed framework may be extended, where each of the other agents, , is modeled separately as a level- decision maker and the ego agent estimates each according to agent ’s historical behavior. We envision that such an extension is mainly a computational challenge rather than a theoretical one. Addressing it is left as a topic to future research.
-  (2016) Cognitive hierarchy theory for heterogeneous uplink multiple access in the internet of things. In International Symposium on Information Theory (ISIT), pp. 1252–1256. Cited by: §I.
-  (2014) On the origins of suboptimality in human probabilistic inference. PLoS Computational Biology 10 (6), pp. e1003661. Cited by: §IV-A.
-  (2012) Not noisy, just wrong: the role of suboptimal inference in behavioral variability. Neuron 74 (1), pp. 30–39. Cited by: §IV-A.
-  (2004) A cognitive hierarchy model of games. The Quarterly Journal of Economics 119 (3), pp. 861–898. Cited by: §I, §III.
-  (2009) Comparing models of strategic thinking in van huyck, battalio, and beil’s coordination games. Journal of the European Economic Association 7 (2-3), pp. 365–376. Cited by: §I, §V.
-  (2006) Cognition and behavior in two-person guessing games: an experimental study. American Economic Review 96 (5), pp. 1737–1768. Cited by: §I, §V.
-  (2002) Bounded rationality: the adaptive toolbox. MIT press. Cited by: §I.
-  (2008) Human–robot interaction: a survey. Foundations and Trends® in Human–Computer Interaction 1 (3), pp. 203–275. Cited by: §I.
-  (2017) Safety-critical advanced robots: a survey. Robotics and Autonomous Systems 94, pp. 43–52. Cited by: §I.
-  (2019) Non-equilibrium dynamic games and cyber–physical security: a cognitive hierarchy approach. Systems & Control Letters 125, pp. 59–66. Cited by: §I, §I, §III-B.
-  (2019) Stochastic predictive control for partially observable Markov decision processes with time-joint chance constraints and application to autonomous vehicle control. Journal of Dynamic Systems, Measurement, and Control 141 (7), pp. 071007. Cited by: §I, §IV-B, §IV-C, §V-B.
-  (2018) Game theoretic modeling of vehicle interactions at unsignalized intersections and application to autonomous vehicle control. In American Control Conference (ACC), pp. 3215–3220. Cited by: §I, §I, §III-A, §III-B, §V.
-  (2018) Tractable stochastic predictive control for partially observable Markov decision processes with time-joint chance constraints. In Conference on Decision and Control (CDC), pp. 3276–3282. Cited by: §I, §IV-B, §IV-C, §IV-C, §V-B.
-  (2018) Game theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems. IEEE Transactions on Control Systems Technology 26 (5), pp. 1782–1797. Cited by: §I, §I, §III-B.
-  (2016) Unmanned aircraft systems airspace integration: a game theoretical framework for concept evaluations. Journal of Guidance, Control, and Dynamics. Cited by: §I, §III-B.
-  (2013) Game theory. Harvard university press. Cited by: §I.
-  (1995) Unraveling in guessing games: an experimental study. The American Economic Review 85 (5), pp. 1313–1326. Cited by: §I, §I, §III.
-  (2018) Planning and decision-making for autonomous vehicles. Annual Review of Control, Robotics, and Autonomous Systems 1, pp. 187–210. Cited by: §I, §I, §V.
-  (1995) On players’ models of other players: theory and experimental evidence. Games and Economic Behavior 10 (1), pp. 218–254. Cited by: §I, §I, §III.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §IV-A.
-  (2018) Adaptive game-theoretic decision making for autonomous vehicle control at roundabouts. In Conference on Decision and Control (CDC), pp. 321–326. Cited by: §I, §III-B, §V.
-  (2014) Predicting pilot behavior in medium-scale scenarios using game theory and reinforcement learning. Journal of Guidance, Control, and Dynamics. Cited by: §I, §III-B.
-  (2018) A human-like game theory-based controller for automatic lane changing. Transportation Research Part C: Emerging Technologies 88, pp. 140–158. Cited by: footnote 1.