## I Introduction

Human-centered robots (e.g., self-driving cars, assistive robots, etc.) operate in close proximity to humans. When designing planning algorithms for human-centered robots, it is critical for the robot to reason about the mutual influence between itself and human actors. Such a mutual dependency can be formulated as a general-sum game, in which a standard approach is to assume that each agent is a perfectly rational, expected utility maximizer, who simultaneously responds optimally to all the others (i.e., operates under equilibrium strategies) [myerson2013game, schwarting2019social]. However, experimental studies [goeree2001ten, crawford2007level, wright2014level] suggest that human behaviors often systemically deviate from equilibrium behaviors due to their latent cognitive states: bounded intelligence (cognitive limitation) and irrationality (tendency to make errors). Therefore, a robot must account for its human partner’s cognitive states for seamless and safe interactions.

Recent works exploited the leader-follower model [Sadighinformation, sadigh2016planning, fisac2019hierarchical, stefansson2019human] and the level- model [LiUnsignalized, tian2018adaptive, Sisi2019] to equip robots with the ability to reason about humans’ non-equilibrium behaviors. These planners either assign humans’ latent states a priori

, omitting humans’ distinct cognitive characteristics, or passively adapt to the humans’ latent states, sacrificing the benefits from actively learning the latent states (Sec.

II).In this work, we propose an anytime game-theoretic planning framework that integrates iterative reasoning models, a partially observable Markov decision process (POMDP), and chance-constrained Monte-Carlo belief tree search. Drawing inspiration from behavioral game theory, we model humans’ intelligence levels and degrees of rationality as their latent cognitive states, capturing their heterogeneous cognitive limitations and tendencies to make errors. Rather than passively adapting to humans’ latent states when planning, our approach enables the robot to actively and safely learn the latent states to achieve its goal more effectively

without losing the ability of real-time execution. Our key insight is: Human-centered robots can exploit the mutual influence in interactions to design actions that reveal their human partners’ cognitive limitations and degrees of rationality. By actively reasoning about these latent states, robots can achieve more effective planning. Overall, we make the following contributions:An active anytime game-theoretic planner. We formalize the robot planning problem in a human-robot team as a POMDP with the human’s cognitive states as latent states. The POMDP is approximately solved by an open-loop Monte-Carlo belief tree search algorithm in an anytime manner. Coupled with explicit realization of active information gathering on the latent states, chance-constrained branching, and tailored terminal value functions, our planner enables the robot to safely and adaptively balance between exploration (learning the human’s latent states) and exploitation (maximizing utility).

Application of our framework to autonomous driving. The proposed behavioral planner is connected with an off-the-shelf low-level motion control layer [borrelli2017predictive] to achieve feedback control for an autonomous car represented by a high-fidelity model. Simulations and user studies are conducted to show the effectiveness of our planner compared to baselines.

## Ii Related Work

Game-theoretic planning for human-centered robots. Our work is related to [Sadighinformation, sadigh2016planning, fisac2019hierarchical, stefansson2019human, LiUnsignalized, tian2018adaptive, Sisi2019]. In [sadigh2016planning, fisac2019hierarchical, stefansson2019human], the robot exploits the Stackelberg game [simaan1973additional] and models its human partner as a follower who accommodates the robot’s planned actions. The follower model is homogeneous (the human is always the follower w.r.t. the robot), thus the robot may behave poorly when the human does not behave like a follower. The approach in [Sadighinformation] allows the robot to actively “probe” the human’s latent states, but the underlying human model is still a pure follower model. In addition, the safety of the planner is not explicitly enforced in [Sadighinformation]. In [LiUnsignalized, tian2018adaptive, Sisi2019], the robot exploits the level- model [costa2001cognition] to model the human as an agent who reasons under various intelligence levels. While the level- model is heterogeneous, it assumes that humans with lower-levels of intelligence best respond to higher-level humans, omitting humans’ potential irrationality. Furthermore, the planners in [LiUnsignalized, tian2018adaptive, Sisi2019] passively adapt to the human’s intelligence level, leading to less effective plans. Our approach is different from [Sadighinformation, sadigh2016planning, fisac2019hierarchical, stefansson2019human] since our planner does not assume humans’ cognitive characteristics in interactions a priori but reasons about them when planning. Our work is also distinguished from [LiUnsignalized, tian2018adaptive, Sisi2019] by enabling the robot to safely and actively learn the human’s cognitive states to maximize utility more effectively without losing the ability of real-time execution.

Solution methods for POMDP. A POMDP is a framework for planning under uncertainty. Various approaches have been proposed to approximate POMDPs, including point-based methods [pineau2003point, porta2006point, kurniawati2008sarsop], open-loop strategies [yu2005open, weinstein2013open, perez2015open, phan2019memory], and Monte-Carlo tree search [silver2010monte, lim2019sparse]. Partially observable Monte-Carlo planning (POMCP) [silver2010monte] performs well, though building a closed-loop search tree in large games is computationally expensive. Open-loop strategies condition action selection on previous action sequences. They use a much smaller search space by sacrificing the ability of active information gathering, and achieve competitive performance compared to closed-loop planning under computational constraints [weinstein2013open]. Our approach combines the strengths from POMCP and open-loop strategies, achieving real-time active game-theoretic planning.

## Iii Problem formulation

Human-robot team formalization. We formalize the human-robot team as a two-player dynamic game represented by the tuple , where represents the two players with denoting the robot and denoting the human; and are, respectively, the joint fully-observable state and action spaces of the two agents; the function governs the evolution of the joint fully-observable state and is defined by the following dynamic model: ( , , ); is the reward function of an agent.

Robot planning as a POMDP. We consider planning from the robot’s perspective. The fully-observable state of the human-robot team represents measurable variables (e.g., position, speed, etc.). In addition, the human also has latent states (e.g., preference, trust, cognitive limitation, etc.) that characterize his/her cognition and reasoning; such latent states cannot be observed directly, and therefore must be inferred from interactions. We let denote the human’s latent states, and we consider an augmented state space . As the robot’s knowledge about the augmented state is incomplete, it maintains a belief distribution over at each discrete time step , namely, the robot maintains the belief state . We formulate the robot planning problem as a POMDP defined by the tuple , where denotes the dynamic game model defined above; is the augmented state space;

represents the space of probability distributions over

(); is the finite observation space;is a probability function specifying the probability of receiving an observation in a state; the belief dynamics function

governs the belief state evolution and is defined as . Given an initial belief state , the robot executes the action , receives the observation at time step , and updates its belief accordingly. denotes the reward function of the robot in belief space (defined in Sec. V-A); represents the set of safe states of the robot.We let denote a robot’s deterministic policy. Given belief state , the robot maximizes its value:

(1) |

where is the value function representing the robot’s expected return starting from , subject to its policy and the belief dynamics function. Note the robot needs to reason about both its own actions and its human partner’s responses due to the mutual dependence. The POMDP formulation allows us to condition the robot behaviors on the inferred latent states.

## Iv Human latent states modeling

### Iv-a Iterative Reasoning Model

When studying interactions in games, players are commonly assumed to adopt the Nash equilibrium solution: each player has unlimited computational resources and responds optimally to the others. In real life however, humans are known to act irrationally and have cognitive limitations. The iterative reasoning models from behavioral game theory have been proven to show better performance in characterizing humans’ reasoning capabilities in simultaneous games [wright2014level]. Examples of iterative reasoning models include: the level- model [costa2001cognition], the cognitive hierarchy model [camerer2004cognitive], and the quantal level- model [stahl1994experimental]. All these models aim to capture humans’ cognitive limitations and share a common feature: they model humans as agents with heterogeneous bounds on their reasoning abilities, i.e., human agents can only perform a finite number of iterations of reasoning, and such an intelligence bound is referred as the intelligence level. Among the various integrative reasoning models, the quantal level- model is the state-of-the-art [WRIGHT201716].

### Iv-B Human Quantal Level-k Reasoning Model

Quantal best response and rationality coefficient. One of the key components of the quantal level- (ql-) model is quantal best response. The notion behind quantal best response is that human players are more likely to select actions with higher expected future rewards [mckelvey1995quantal]. Formally, we define the quantal best response function as follows: let denote agent ’s expected total reward () when executing in against an action from his/her opponent . Then a quantal best response by agent to agent is a mixed policy:

(2) |

where is the rationality coefficient that controls the degree of agent conforming to optimal behaviors. In general, the larger the is, the more rational the human is.

Human quantal level- policies. In the ql- model, the iterative reasoning process starts from ql- agents who are non-strategic reasoners. Then, a ql- agent, , assumes the other agents are ql- agents, predicts their ql- policies, and quantally best responds to the predicted ql- policies. On the basis of ql- policies, the ql- policies are defined for every , for every , and for every through a sequential and iterative process. Specifically, given an initial state , a ql- agent maximizes the following objective: , where is the ql- value function of agent and is the predicted ql- policy of agent . The optimal value function satisfies the following Bellman equation: , and can be determined via value iteration. Then, we define the -value function as:

(3) |

and (2) is adopted to define agent ’s ql- policy. We note that when agent predicts its opponent’s ql- policy, it assumes that its opponent’s rationality coefficients is also (i.e., ) and forms its policy based on . We summarize the algorithm that computes the ql- policies of the agents in in Alg. 1.

## V Anytime Active game-theoretic planning

In this section, we embed the human ql- model in the POMDP through belief dynamics following the procedure in [Sisi2019, chen2020trust] and present our anytime game-theoretic planner.

### V-a Embed the Human Behavioral Model in Robot Planning

Observation function. We define an observation made by the robot as , i.e., the robot can measure the joint physical state . Then the observation function is defined as: , where is the joint physical state in , and is an indicator function, taking if the event is true and taking otherwise.

Prior belief state. We define the probability of arriving to state from state after executing as:

(4) | ||||

where is the dynamics function described in Sec. III, represents an explicit probabilistic model that governs the dynamics of the latent states ( denotes the model parameters), and denotes the human’s ql- policy with rationality coefficient (recall ). Then, we can define a prior belief state prediction function that predicts the future belief state without accounting for the possible observations: , where each element in is computed following . Then, the robot’s reward function in belief space can be defined as: .

Prior probability of future observations. With the observation function and the prior belief prediction function defined, given an initial belief state , we can predict the probabilities of the robot’s future observations after executing an action following:

(5) |

Posterior belief update. After executing an action and receiving an observation, the robot can update its posterior belief state through the belief dynamics equation . More specifically, each element in

can be computed using the Bayesian inference equation

[sarkka2013bayesian]:(6) |

### V-B Robot Planning Algorithm

Closed-loop policy. In stochastic environments, the robot’s actions can help “actively learn” the latent states for the benefits of the future. [bar1974dual] shows that only the closed-loop policy has the capability of active learning as it optimally balances between uncertainty reduction and maximizing utility. Solving (1) via stochastic dynamic programming to obtain a closed-loop policy is computationally intractable [bellman1966dynamic]. To achieve real-time computation, we exploit open-loop strategies, but compensate for the loss of active learning.

Open-loop feedback strategy. In contrast to solving for the optimal policy in (1), we let the robot solve for an optimal action sequence at each :

(7) |

where is the planning horizon, is a planned action sequence, and denotes the terminal value of the predicted belief state . The robot plans in a feed-back manner by applying the first action in and re-planing at the next time step. As opposed to the closed-loop policy, (7) fixes the action plan ahead and omits the benefits that can be propagated back from future observations. Consequently, (7) only passively learns the latent states and yields conservative actions. Hence, explicit methods can be used to actively learn the latent states [WITTENMARK199567].

Active learning of the latent states. We exploit the Shannon entropy [shannon2001mathematical]

to measure the estimation uncertainty of a belief state, and augment the robot’s reward function with the expected information gain:

(8) |

where , and is the observation prediction function defined in (5). In general, the higher the is, the more expected information the robot obtains about the human’s latent states if the robot executes in . With the augmented reward function , (7) becomes:

equationparentequation

(9a) | |||

(9b) | |||

(9c) |

where is an adaptive term that enables the robot to learn the latent states as needed. The constraint (9c) requires that the probability that the predicted state is in the safe set is larger than for all steps over the planning horizon, with being a design parameter. The overall safety is bounded by via risk allocation [blackmore2009convex, ono2012joint] (evaluation of risk is shown in Alg. 2). [Sisi2019] exploited continuous relaxation techniques to solve (7) with a time-joint chance constraint. However, with the information reward (9b), the approach in [Sisi2019] could become intractable.

Open-loop chance-constrained Monte-Carlo belief tree search. POMCP [silver2010monte] combines Monte-Carlo simulation and game-tree search. Building upon POMCP, we propose an algorithm (Alg. 2) to solve (9) in an anytime manner. Our algorithm differs from POMCP in the following ways: 1) a search node, , stores an action sequence and its associated statistics: is the mean return of all simulations that execute , and counts the number of times that has been visited; 2) leaf node expansions must enforce the safety chance constraint (line 14); 3) terminal values are estimated using the pre-computed ql- values (line 11); 4) the active information gathering on latent states is explicitly realized via the augmented reward function .

Benefits of Algorithm 2. A search node only stores an action sequence since the open-loop optimization (9) searches for an action sequence. In contrast to POMCP, in which a node stores a history of actions and observations, the search space in Alg. 2 is significantly reduced. Hence Alg. 2 can run in real-time under computational constraints. Quantal level- reasoning (Sec. IV) is exploited to estimate the terminal value when the maximum planning depth is reached (line 11). Specifically, the terminal belief state is used to determine the human’s intelligence level distribution , then, we assume the robot behaves as a ql- agent (recall that a ql- agent quantally best responds to a ql- agent) with rationality coefficient and estimates the terminal value using the pre-computed robot’s ql- value weighted by . By actively learning the belief state, the planner can quickly reduce the estimation error on the terminal values and maximize its performance.

Infeasibility handling mechanism. When the root node has no safe successors, we relax the chance constraint and find the action that minimizes the degree of constraint violation.

## Vi Experiments

### Vi-a Implementation Details

Test domain. We use autonomous driving as the test domain. In particular, we consider a forced merging scenario [isele2019interactive], where an autonomous car must merge to an adjacent (upper) lane that is occupied by a human-driven car (Fig. 1).

Reward functions. The reward function of a car is a linear combination of features ,satisfying . The features encode safety, comfort, progress, etc. [tian2018adaptive]. The weights can be recovered via an inverse learning algorithm [ziebart2008maximum, sun2019interpretable, tian2020bounded]. Note that when the autonomous car runs Alg. 2, the safety feature in its reward function is deactivated since the safety is handled by (9c).

Level-0 policy. Alg. 1 requires a ql- policy to initiate the iterative reasoning process. Similar to [LiUnsignalized, tian2018adaptive], we let a ql- agent be a non-strategic agent who treats others as static obstacles when making decisions.

Latent state dynamics. Recall (4) that an explicit probabilistic model is required to govern the dynamics of the latent states. In this work, we consider single-shot games, i.e., the human’s latent states are assumed to be constant during interaction. Hence, the latent states’ transition model is reduced to

. In general repeated games, the transition model can be represented as a Markov chain and its parameters

can be embedded in the POMDP and learned simultaneously as in [tianbeating].High-level robot planning. The dynamics of the human-robot team are represented as

(10) |

where () is the longitudinal (lateral) position, () is the longitudinal (lateral) speed, and is the acceleration. The sampling period is . We use a state grid of the size to represent the discrete states of the human-robot system. The safety set includes states in which the boundaries of the two agents do not overlap. We use and as the chance constraint thresholds in (9c). We let the highest intelligence level of the human be based on experimental results in [costa2006cognition, costa2009comparing]. The rationality coefficients take value from the set . The planning horizon in (9) is .

Hierarchical planning and control. Behavioral planning and control of the autonomous car are hierarchically connected. The planning layer (Alg. 2) uses a low-fidelity model (10) to generate behavioral commands, and runs at on a laptop with 2.8 GHz CPU. In the low-level control layer (running at ), the vehicle dynamics are represented by a high-fidelity bicycle model [kong2015kinematic], and we use a model predictive controller [borrelli2017predictive, cvxgen] to generate continuous controls that drive the system to the desired states generated by Alg. 2. In Alg. 2, when the actual system state deviates from the state grid, the nearest neighboring grid will be exploited.

Baselines. We consider two baseline planners: 1) our planner without the feature of active information gathering, i.e., the autonomous car passively infers the human’s latent states (BLP-1); 2) the strategic game-theoretic planner in [fisac2019hierarchical] that treats the human-driven car as a follower who accommodates the actions from the autonomous car (BLP-2). Both BLP-2 and our planner use a closed-loop feedback structure when building human behavioral models, but our planner reasons about the heterogeneity in the human’s cognitive limitations and irrationality through active inference rather than treating the human as a follower.

### Vi-B The Human Behavioral Model

Human intelligence level interpretation. The ql- model is exploited to reason about human behaviors under bounded intelligence. Recall that the level- agent represents a non-strategic agent who treats others as static obstacles. Thus, a ql- agent can be interpreted as a cautious agent since it believes that its opponent is an aggressive non-strategic agent. On the contrary, a ql- agent behaves aggressively since it believes its opponent is a cautious ql- agent.

Fig. 2 shows the interactions between two cars modeled as ql- agents in the forced merging task. The heat-map displays the ql- value function ( described in Sec. IV-B) of the lower-lane car, indicating the preferred states; colder color means higher value. It can be observed in (a-b) that the interactions are seamless between a ql- agent and a ql- agent, which is expected since the ql- agent’s belief in the model of the ql- agent matches the ground truth (note that the high-value region in the upper lane encourages the red car to merge ahead of the white car (a); the low-value region in front of the white car guides it to yield (b)). However, when an agent’s true model deviates from the other agent’s belief, conflicts may occur. For instance, (c) shows the interaction between two ql- agents with a dead-lock because both agents prefer to yield. The low-value region in front of the yellow car discourages it to merge although it is safe. Similarly, when two ql- agents interact (d), collisions may occur as both agents think their opponents will be likely to yield. Note that the high value region in the upper lane encourages the yellow car to merge even if it is not safe.

### Vi-C Case Studies and Quantitative Results

We compare our planner against the baselines via simulations. The human driven-car is modeled as a ql- agent, and is also controlled by the hierarchical planning and control scheme described in Sec. VI-A, with behavioral commands obtained directly from the corresponding ql- policies.

Hypotheses. We state the following two hypotheses: 1). active exploration improves efficiency of the robot’s planning; 2). our planner is robust to the human’s heterogeneous intelligence levels and irrational behaviors.

We first show two case studies, highlighting the benefits that naturally emerged from our planner.

Scenario 1. Fig. 3(a-b) show a scenario where the autonomous car and a simulated cautious human-driven car (ql- agent) with rationality coefficient start at the same initial speed and longitudinal position. It can be observed that with our planner, the autonomous car actively indicates an intention to merge by nudging into the target lane, as it predicts that, the human’s reactions triggered by the “probing” action can help disambiguate the human’s latent states. With the baseline planner (BLP-1), the autonomous car takes a longer time to merge, since passive inference requires additional observations to infer the latent states.

Scenario 2. Fig. 3(c-d) show a case that is similar to scenario 1 but the human-driven car is simulated by an aggressive ql- agent who starts behind the autonomous car. With our planner, the autonomous car nudges in to explore the human’s latent states, then quickly decides to yield after observing the humans’ aggressive reactions. With BLP-2, the autonomous car initiates a dangerous merge as BLP-2 incorrectly assumes the human-driven car will likely yield. Note that when humans behave under heterogeneous cognitive states, it is critical for a robot to reason about such a heterogeneity to better predict human behaviors.

Metrics for quantitative results. In quantitative studies, we use two metrics for validating the hypotheses : 1) the rate of success (RS), which measures the percentage of simulations in which no collision or dead-lock occurs; 2) the time used by the autonomous car to complete the merge (TM).

Results. We run simulations for each of the scenarios in the case studies using our planner and two baseline planners. We first compare our planner against BLP-1 in terms of inference performance (BLP-2 has no inference capability). In Fig. 4 (a-b), we show the time history of the autonomous car’s belief in the ground truth human-driven car’s latent states. It can be observed that our planner can more effectively identify the human’s latent states by exploiting the mutual influence to make its human partner reveal his/her hidden states. In Fig. 4 (c), we show the average TM for the two scenarios. Note that our planner achieves the lowest TM, due to the active inference. Furthermore, our planner achieves the highest confidence on TM since our planner strategically generates actions that aim for triggering the most informative reactions from the human, this regulates human behaviors in a game-theoretic sense (such a phenomena is also observed in user studies Fig. 5(b)).

Next, we evaluate the robustness of our planner under various human intelligence levels and degrees of irrationality. We let the simulated ql- human take his/her rationality coefficient from and run simulations for each () combination. The human starts at a random position within 10 around the autonomous car. In Fig. 4(d-e), we show the RS of each planner for each scenario. Note that in all cases, our planner and BLP-1 achieve more than 95% RS. This is attributed to reasoning about the human’s bounded intelligence and irrationality, and enforcing the safety chance constraint. BLP-2 shows satisfactory RS when the simulated human is a ql- agent, but the RS decreases as the human becomes more irrational. When the human is a ql- agent, BLP-2 performs noticeably worse. The results indicate that our planner enables the robot to learn the human’s latent states, plan more effectively, and be robust to the human’s various intelligence levels and irrational behaviors.

### Vi-D User Study

Objective. We conduct user studies in which we let the autonomous car interact with a real human driver in a simulator (Fig. 1), showcasing the effectiveness of our planner.

Experiment setup. We recruited 10 human participants. For each participant, we ran Scenario 1 for times using each of the planners. Here, the accelerations of the upper lane car are provided by human participants directly through a pedal.

Results. Fig. 5(a) shows the RS of each planner. It can be observed that both our planner and BLP-1 outperform the BLP-2 in terms of safety. Fig. 5(b) shows the time histories of the step reward collected by the autonomous car. Note that although the autonomous car pays more costs in the first few steps using our planner (due to the probing actions) it is able to complete the task much more effectively and safely compared with the baselines. In Fig. 5(c), we show the distributions of inferred intelligence levels and rationality coefficients. It can be observed that roughly 60% of the participants are identified as ql- agents, which is aligned with the experiment study on other forms of games [costa2009comparing]. In addition, Fig. 5(c) suggests that, under the reward function assumed by the autonomous car, most of the participants demonstrated behaviors that align with the behaviors produced by the rationality coefficient .

## Vii Conclusion

We proposed an anytime game-theoretic planning framework that integrates iterative reasoning models, POMDP, and chance-constrained Monte-Carlo belief tree search for robot behavioral planning. Our planner enables a robot to safely and actively reason about its human partner’s latent cognitive states in real-time to maximize its utility more effectively. We applied the proposed approach to an autonomous driving domain where our behavioral planner and a low-level motion controller hierarchically control an autonomous car to negotiate traffic merges. Both simulation and user study results demonstrated the effectiveness of our planner compared with baseline planners.

0.98