1 Author Summary
Agents interacting in games with multiple rounds must model their partners’ thought processes over extended time horizons. This poses a substantial computational challenge that has restricted previous behavioural analyses. By taking advantage of recent advances in algorithms for planning in the face of uncertainty, we demonstrate how these formal methods can be extended. We use a well studied social exchange game called the trust task to illustrate the power of our method, showing how agents with particular cognitive and social characteristics can be expected to interact, and how to infer the properties of individuals from observing their behaviour.
2 Introduction
Successful social interactions require individuals to understand the consequences of their actions on the future actions and beliefs of those around them. To map these processes is a complex challenge in at least three different ways. The first is that other peoples’ preferences or utilities are not known exactly. Even if the various components of the utility functions are held in common, the actual values of the parameters of partners, e.g., their degrees of envy or guilt [1, 2, 3, 4, 5, 6], could well differ. This ignorance decreases through experience, and can be modeled using the framework of a partially observable Markov decision process (POMDP). However, normal mechanisms for learning in POMDPs involve probing or running experiments, which has the potential cost of partners fooling each other. The second complexity is represented by characterizing the form of the model agents have of others. In principle, agent A’s model of agent B should include agent B’s model of agent A; and in turn, agent B’s model of agent A’s model of agent B, and so forth. The beautiful theory of Nash equilibria [7], extended to the case of incomplete information via socalled BayesNash equilibria [8] dispenses with this socalled cognitive hierarchy [9, 10, 11, 12], looking instead for an equilibrium solution. However, a wealth of work (see for instance [13]) has shown that people deviate from Nash behaviour. It has been proposed people instead model others to a strictly limited, yet nonnegligible, degree [9, 10].
The final complexity arises when we consider that although it is common in experimental economics to create oneshot interactions, many of the most interesting and richest aspects of behaviour arise with multiple rounds of interactions. Here, for concreteness, we consider the multi round trust task, which is a social exchange game that has been used with hundreds of pairs (dyads) of subjects, including both normal and clinical populations [14, 15, 16, 17, 18]. This game has been used to show that characteristics that only arise in multiround interactions such as defection (agent A increases their cooperation between two rounds; agent B responds by decreasing theirs) have observable neural consequences that can be measured using functional magnetic resonance imaging (fMRI) [19, 14, 20, 21, 22].
The interactive POMDP (IPOMDP) [23] is a theoretical framework that formalizes many of these complexities. It characterizes the uncertainties about the utility functions and planning over multiple rounds in terms of a POMDP, and constructs an explicit cognitive hierarchy of models about the other (hence the moniker ’interactive’). This framework has previously been used with data from the multiround trust task [24, 20]. However, solving IPOMDPs is computationally extremely challenging, restricting those previous investigations to a rather minuscule degree of forward planning (just two out of what is actually a tenround interaction). Our main contribution is the adaptation of an efficient Monte Carlo tree search method, called partially observable Monte Carlo planning (POMCP) to IPOMDP problems. Our second contribution is to illustrate this algorithm through examination of the multiround trust task. We show characteristic patterns of behaviour to be expected for subjects with particular degrees of inequality aversion, othermodeling and planning capacities, and consider how to invert observed behaviour to make inferences about the nature of subjects’ reasoning capacities.
3 Materials and Methods
We first briefly review Markov decision processes (MDPs), their partially observable extensions (POMDPs), and the POMCP algorithm invented to solve them approximately, but efficiently. These concern single agents. We then discuss IPOMDPs and the application of POMCP to solving them when there are multiple agents. Finally, we describe the multiround trust task.
3.1 Partially Observable Markov Decision Processes
A Markov decision process (MDP) [25] is defined by sets of ”states” and of ”actions”, and several components that evaluate and link the two, including transition probabilities , and information about possible rewards. States describe the position of the agent in the environment, and determine which actions can be taken, accounting for, at least probabilistically, the consequences for rewards and future states. Transitions between states are described by means of a collection of transition probabilities , assigning to each possible state and each possible action
from that state, a transition probability distribution or measure
which encodes the likelihood of ending in state after taking action from state . The Markov property requires that the transition (and reward probabilities) only depend on the current state (and action), and are independent from the past events. An illustration of these concepts can be found in figure 1.By contrast, in a partially observable MDP (i.e., a POMDP [26]), the agent can also be uncertain about its state . Instead, there is a set of observations that incompletely pin down states, depending on the observation probabilities These report the probability of observing when action has occasioned a transition to state . See figure 2 for an illustration of the concept.
We use the notation , or to refer explicitly to the outcome state, action or observation at a given time. The history is the sequence of actions and observations, wherein each action from the point of view of the agent moves the time index ahead by , . Here
may be trivial (deterministic or empty). The agent can perform Bayesian inference to turn its history at time
into a distribution over its state at time , wheredenotes the random variable encoding the uncertainty about the current state at time
. This distribution is called its belief state , with Inference depends on knowing and the distribution over the initial state , which we write as . Information about rewards comprises a collection of utility functions , a discount function ^{5}^{5}5A more general definiton would be , allowing it to be conditional on the precise present and future histories. and a survival function . The utility functions determine the immediate gain associated with executing action at state and observing (sometimes writing for the reward following the action). From the utilities, we define the reward function , as the expected gain for taking action at state as , where this expectation is taken over all possible observations . Since we usually operate on histories, rather than fixed states, we define the expected reward from a given history as The discount function weights the present impact of a future return, depending only on the separation between present and future. We use exponential discounting with a fixed number to define our discount function:(3.1) 
Additionally, we define such that is for and otherwise. in general is a random stopping time. We call the second component the reference time of the survival function.
The survival function allows us to encode the planning horizon of an agent during decision making: If is for , we say that the local planning horizon at is less than or equal to .
The policy is defined as a mapping of histories to probabilities over possible actions. Here is called the set of admissible policies. For convenience, we sometimes write the distribution function as . The value function of a fixed policy starting from present history is
(3.2) 
i.e., a sum of the discounted future expected rewards (note that is a random variable here, not a fixed value). Equally, the stateaction value is
(3.3) 
Definition 1 (Formal Definition  POMDP).
Using the notation of this section, a POMDP is defined as a tuple of components as outlined above.
Convention 1 (Softmax Decision Making).
A wealth of experimental work (for instance [27, 28, 29]) has found that the choices of humans (and other animals) can be well described by softmax policies based on the agents’ stateaction values, to encompass the stochasticity of observed behaviour in real subject data. See [30], for a behavioural economics perspective and [11] for a neuroscience perspective. In view of using our model primarily for experimental analysis, we will base our discussion on the decision making rule:
(3.4) 
where is called the inverse temperature parameter and controls how diffuse are the probabilities. The policy
(3.5) 
can be obtained as a limiting case for .
Convention 2.
From now on, we shall denote by , the stateaction value with respect to the softmax policy.
3.2 Pomcp
POMCP was introduced by [31] as an efficient approximation scheme for solving POMDPs. Here, for completeness, we describe the algorithm; later, we adapt it to the case of an IPOMDP.
POMCP is a generative modelbased sampling method for calculating historyaction values. That is, it builds a limited portion of the tree of future histories starting from the current , using a samplebased search algorithm (called upper confidence bounds for trees (UCT); [32]) which provides guarantees as to how far from optimal the resulting action can be, given a certain number of samples (based on results in [33] and [34]). Algorithm 1 provides pseudo code for the adapted POMCP algorithm. The procedure is presented schematically in figure 3.
The algorithm is based on a tree structure , wherein nodes represent possible future histories explored by the algorithm, and are characterized by the number of times history was visited in the simulation, the estimated value for visiting and the approximate belief state at . Each new node in is initialized with initial action exploration counts for all possible actions from and an initial action value estimate for all possible actions from and an empty belief state .
The value is then calculated from all actions counts from the node . denotes the mean of obtained values, for simulations starting from node .
can either be calculated analytically, if it is computationally feasible to apply Bayes theorem, or be approximated by the so called
root sampling procedure (see below).In terms of the algorithm, the generative model of the POMDP determines , the simulated reward, observation and subsequent state for taking at ; itself is sampled from the current history . Then, every (future) history of actions and observations defines a node in the tree structure , which is characterized by the available actions and their average simulated action values under the policy SoftUCT at future states.
If the node has been visited for the time; with action being taken for the time, then the average simulated value is updated (starting from ) using sampled simulated rewards up to terminal time , when the current simulation/tree traversal ends as:
(3.6) 
The search algorithm has two decision rules, depending on whether a traversed node has already been visited or is a leaf of the search tree. In the former case, a decision is reached using SoftUCT by defining
(3.7) 
where is a parameter that favors exploration (analogous to an equivalent parameter in UCT).
If the node is new, a socalled ”rollout” policy is used to provide a crude estimate of the value of the leaf. This policy can be either very simple (uniform or greedy based on a very simple model) or specifically adjusted to the search space, in order to optimize performance.
The rollout value estimate together with the SoftUCT exploration rule is the core mechanism for efficient tree exploration. In this work, we only use an greedy mechanism, as is described in the section on the multi round trust game.
Another innovation in POMCP that underlies its dramatically superior performance is called root sampling. This procedure allows to form the belief state at later states, as long as the initial belief state is known. This means that, although it is necessary to perform inference to draw samples from the belief state at the root of the search tree, one can then use each sample as if it was (temporarily) true, without performing inference at states that are deeper in the search tree to work out the new transition probabilities that pertain to the new belief states associated with the histories at those points. The reason for this is that the probabilities of getting to the nodes in the search tree represent exactly what is necessary to compensate for the apparent inferential infelicity [31]– i.e., the search tree performs as a probabilistic filter. The technical details of the root sampling procedure can be found in [31].
In the presence of analytically tractable updating rules (or at least analytically tractable approximations) the belief state at a new node can instead be calculated by Bayes’ theorem. In the case for the multi round trust game below, we follow the approximating updating rule in [20].
3.3 Interactive Partially Observable Markov Decision Processes
An Interactive Partially Observable Markov Decision Process (IPOMDP) is a multi agent setting in which the actions of each agent may observably affect the distribution of expected rewards for the other agents.
Since IPOMDPs may be less familiar than POMDPs, we provide more detail about them; consult [23] for a complete reference formulation and [35] for an excellent discussion and extension.
We define the IPOMDP such that the decision making process of each agent becomes a standard (albeit large) POMDP, allowing the direct application of POMDP methods to IPOMDP problems.
Definition 2 (Formal Definition  IPOMDP).
An IPOMDP is a collection of POMDPs such that the following holds:
Agents are indexed by the finite set . Each agent is described by a single POMDP (, , , , , ,, denoting its actual decision making process. We first define the physical state space : an element is a complete setting of all features of the environment that determine the action possibilities and obtainable rewards of for the present and all possible following histories, from the point of view of . The physical state space is augmented by the set of models of the partner agents , called intentional models, which are themselves POMDPs =, , , , , , , . These describe how agent believes agent perceives the world and reaches its decisions. The possible state space of agent can be written and a given state can be written , where is the physical state of the environment and are the models of the other agents. Note that the intentional models contain themselves state spaces that encode the history of the game as observed by agent from the point of view of agent . The elements of are called interactive states. Agents themselves act according to the softmax function of historyaction values, and assume that their interactive partner agents do the same. The elements of the definition are summarized in figure 4.
Convention 3.
We denote by capital and capital the random variables, that encode uncertainty about the physical state and the interactive state respectively.
When choosing the set of intentional models, we consider agents and their partners to engage in a cognitive hierarchy of successive mentalization steps [10, 9], depicted in figure 5. The simplest agent can try to infer what kind of partner it faces (level thinking). The next simplest agent could additionally try to infer what the partner might be thinking of it (level ). Next, the agent might try to understand their partner’s inferences about the agent’s thinking about the partner (level ). Generally, this would enable a potentially unbounded chain of mentalization steps. It is a tenet of cognitive hierarchy theory [10] that the hierarchy terminates finitely and for many tasks after only very few steps (e.g., Poisson, with a mean of around ) .
We formalize this notion as follows.
Definition 3 (A Hierarchy Of Intentional Models).
Since models of the partner agent may contain interactive states in which it in turn models the agent , we can specify a hierarchical intentional structure , built from what we call the level intentional models . is defined inductively from
This means that any level intentional model reacts strictly to the environment, without holding any further intentional models. The higher levels are obtained as
Here denotes the intentional models, that agent thinks agent might hold of the other players. These level intentional models arise by the same procedure applied to the level models that agent thinks agent might hold.
Definition 4 (Theory of Mind (ToM) Level).
We follow a similar assumption as the so called level thinking (see [9]), in that we assume that each agent operates at a particular level (called the agent’s theory of mind (ToM) level; and which it is assumed to know), and models all partners as being at level .
Convention 4.
It is necessary to be able to calculate the belief state in every POMDP that is encountered. An agent updates its belief state in a Bayesian manner, following an action and an observation . This leads to a sequential update rule operating over the belief state of a given agent at a given time :
(3.8) 
Here
is a normalization constant associated with the joint distribution of transition and observation probability, conditional on
, and . The observation in particular incorporates any results of the actions of the other agents, before the next action of the given agent.We note that the above rule applies recursively to every intentional model in the nested structure , as every POMDP has a separate belief state.
This is slightly different from [23] so that the above update is conventional for a POMDP.
Convention 5 (Expected Utility Maximisation).
The decision making rule in our IPOMDP treatment is based on expected utility as encoded in the reward function. The explicit formula for the action value under a softmax policy (equation 3.4) is:
(3.9) 
Here and denotes the action value at with the survival function conditioned to reference time . is the discount factor of agent , rather than the th power. This defines a recursive Bellman equation, with the value of taking action given history being the expected immediate reward plus the expected value of future actions conditional on and its possible consequences discounted by .
The belief state allows us to link to a distribution of interactive states and use to calculate , in particular including the reactions of other agents to the actions of one agent. We call the resulting policy the ”solution” to the IPOMDP.
3.4 Equilibria and IPOMDPs
Our central interest is in the use of the IPOMDP to capture the interaction amongst human agents with limited cognitive resources and time for their exchanges. It has been noted in [10] that the distribution of subject levels favours rather low values (e.g., Poisson, with a mean of around ). In the opposite limit, sufficient conditions are known in which taking the cognitive hierarchy out to infinity for all involved agents allows for at least one BayesNash equilibrium solution (part II, theorem II, p. of Harsanyi [8]) and sufficient conditions have been shown in [36], given which a solution to the infinite hierarchy model can be approximated by the sequence of finite hierarchy model solutions. A discussion of a different condition can be found in [37]; however, this condition does assume a infinite time horizon in the interaction. In general, as [10], p. notes, it is not true that the infinite hierarchy solution will be a Nash equilibrium. For the purposes of computational psychiatry, we find the very mismatches and limitations, that prevent subjects’ strategies to evolve to a (Bayes)Nash equilibrium in the given time frame, to be of particular interest. Therefore we restrict our attention to quantal response equilibrium like behaviours ([30]) , based on potentially inconsistent initial beliefs by the involved agents with ultimately very limited cognitive resources and finite time exchanges.
3.5 Applying POMCP to an IPOMDP
An IPOMDP is a collection of POMDPs, so POMCP is, in principle, applicable to each encountered POMDP.
However, unlike the examples in [31], an IPOMDP contains the intentional model POMDPs as part of the state space, and these themselves contain a rich structure of beliefs. So, the state is sampled from the belief state at the root for agent is an tuple of a physical state and POMDPs, one for each partner. (This is also akin to the random instantiation of players in [8]). Since the still contain belief states in their own right, it is still necessary to do some explicit inference during the creation of each tree. Indeed, explicit inference is hard to avoid altogether during simulation, as the interactive states require the partner to be able to learn [23]. Nevertheless, a number of performance improvements that we detail below still allow us to apply the POMCP method involving substantial planning horizons.
3.6 Simplifications for dyadic repeated exchange
Many social paradigms based upon game theory, including the iterated ultimatum game, prisoners’ dilemma, iterated ”rock, paper, scissors” (for
agents) and the multi round trust game, involve repeated dyads. In these, each interaction involves the same structure of physical states and actions (see below), and all discount functions are past a finite horizon.Definition 5 (Dyadic Repeated Exchange without state uncertainty).
Consider a two agent IPOMDP framework in which there is no physical state uncertainty: both agents fully observe each others’ actions and there is no uncertainty about environmental influence; and in which agents vary their play only based on intentional models and an agent does not believe that the partner can be made to transition between different intentional models by the agent’s actions. Additionally, the framework is assumed to reset after each exchange (i.e., after both agents have acted once).
Formally this means: There is a fixed setting , such that physical states, actions from these states, transitions in the physical state and hence also obtainable rewards, differ only by a changing time index and there is no observational uncertainty and an agent does not believe that the partner can be made to transition between different intentional models by the agent’s actions. Then after each exchange the framework is assumed to reset to the same distribution of physical initial states within this setting (i.e. the game begins anew).
Games of this sort admit an immediate simplification:
Theorem 1 (Level Recombining Tree).
In the situation of definition 5, level action values at any given time only depend on the total set of actions and observations so far and not the order in which those exchanges were observed.
Proof.
The level partner model only acts on the physical state it encounters and the physical state space variable is reset at the beginning of each round in the situation of 5. Therefore, given a state in the current round and an action by a level agent, the likelihood of each transition to some state , , and of making observation , , is the same at every round from the point of view of the level agent. It follows that the cumulative belief update from equation 3.8, from the initial beliefs to the current beliefs, will not depend on the order in which the action observation pairs were observed.∎
This means, that depending on the size of the state space and the depth of planning of interest, we may analytically calculate level action values even online or use precalculated values for larger problems. Furthermore, because their action values will only depend on past exchanges and not on the order in which they were observed, their decision making tree can be reformulated as a recombining tree.
Sometimes, an additional simplification can be made:
Theorem 2 (Trivialised Planning).
In the situation of definition 5, if the two agents do not act simultaneously and the state transition of the second agent is entirely dependent on the action executed by the first agent (as in the multi round trust task); and additionally the intentional model of the partner can not be changed through the actions of the second agent, then a level second agent can gain no advantage from planning ahead, since their actions will not change the action choices of the first agent.
Proof.
In the scenario described in theorem 2 the physical state variable of the agent is entirely dependent on the action of the other agent. If the agent is level , they model their partner as level and by additional assumption the second agent does not believe that the partner can be made to transition between different intentional models by the second agent’s actions, hence their partner will not change their distribution of state transitions, depending on the agents’ actions and hence also their distribution of future obtainable rewards will not change. ∎
Theorem 3 (Trivialised Theory of Mind Levels).
Proof.
In the scenario described in theorem 2, the second to go level agent behaves like a level agent, as it does not benefit from modeling the partner. This implies that the first to go agent, gains no additional information at the level thinking, since the partner behaves like level , which was modeled by the level first to agent already. In turn, the level second to go agent gains no additional information over the level second to go agent, as the their partner model does not change between modeling the partner at level or level . By induction, we get the result. ∎
3.7 The Trust Task
The multiround trust task, illustrated in figure 6 is a paradigm social exchange game. It involves two people, one playing the role of an ’investor’ the other the one of a ’trustee’, over sequential rounds, expressed by a time index . Both agents know all the rules of the game. In each round, the investor receives an initial endowment of monetary units. The investor can send any of this amount to the trustee. The experimenter trebles this quantity and then the trustee decides how much to send back to the investor, between points and the whole amount that she receives. The repayment by the trustee is not increased by the experimenter. After the trustee’s action, the investor is informed, and the next round starts. We consider the trust task as an IPOMDP with two agents, i.e., contains just for the investor and for the trustee. We consider the state to contain two components; one physical and observable (the endowment and investments), the other nonphysical and nonobservable (in our case, parameters of the utility function). It is the latter that leads to the partial observability in the IPOMDP. Following [24], we reduce complexity by quantizing the actions and the (nonobservable) states of both investor and trustee – shown for one complete round in figure 7. The actions are quantized into fractional categories shown in figure 7. For the investor, we consider (corresponding to an investment of $20, and encompassing even investment ranges). For the trustee, we consider (corresponding to a return of $320, and encompassing even return ranges). Note that the trustee’s action is degenerate if the investor gives . The pure monetary payoffs for both agents in each round are
The payoffs of all possible combinations and both partners are depicted in figure 8. In IPOMDP terms, the investor’s physical state is static, whereas the trustee’s state space is conditional on the previous action of the investor. The investor’s possible observations are the trustees responses, with a likelihood that depends entirely on the investor’s intentional model of the trustee. The trustee observes the investor’s action, which also determines the trustee’s new physical state, as shown in figure 9.
3.7.1 Inequality Aversion  Compulsion to Fairness
The aspects of the states of investor and trustee that induce partial observability are assumed to arise from differential levels of cooperation.
One convenient (though not unique) way to characterize this is via the FehrSchmidt inequality aversion utility function (figure 10). This allows us to account for the observation that many trustees return an even split even on the last exchange of the rounds, even though no further gain is possible. We make no claim that this is the only explanation for such behaviour, but it is a tractable and wellestablished mechanism that has been used successfully in other tasks ([27, 1, 18] ). For the investor, this suggests that:
(3.10) 
Here, is called the ”guilt” parameter of the investor and quantifies their aversion to unequal outcomes in their favor. We quantize guilt into concrete guilt types . Similarly, the trustee’s utility is
(3.11) 
with the same possible guilt types. We choose these particular values, as guilt values above tend to produce similar behaviours as and the values below tend to behave very similar to . Thus we take to represent guilt values in , to represent guilt values in and to represent guilt values in . We assume that neither agents’ actual guilt type changes during the exchanges.
3.7.2 Planning Behaviour
The survival functions and are used to delimit the planning horizon. The agents are required not to plan beyond the end of the game at time and within that constraint they are supposed to plan steps ahead into the interaction. This results in the following form for the survival functions (regardless whether for investor or trustee):
(3.12) 
The value is called the planning horizon. We consider for immediate, medium and long planning types. We chose these values as covers the range of behaviours from to , while planning yields compatibility to earlier works ([24, 20]) and allows to have short planning but high level agents, covering the range of behaviours for planning to . We confirm later that the behaviour of and agents is almost identical; and the former saves memory and processing time. Agents are characterized to assume their opponents have the same degree of planning as they do. The discounting factors and are set to in our setting.
3.8 Belief State
Since all agents use their own planning horizon in modeling the partner and level agents model their partner at level , inference in intentional models in this analysis is restricted to the guilt parameter . Using a categorical distribution on the guilt parameter and Dirichlet prior on the probabilities of the categorical distribution, we get a DirichletMultinomial distribution for the probabilities of an agent having a given guilt type at some point during the exchange. Hence is a DirichletMultinomial distribution ,
with the initial belief state
Keeping consistent with the model in [20], our approximation of the posterior distribution is also a DirichletMultinomial distribution with the parameters of the Dirichlet prior being updated to
(3.13) 
writing for the intentional models.
3.8.1 Theory of Mind Levels and Agent Characterization
Since the physical state transition of the trustee is fully dependent on the investors’ action and one agents’ guilt type can not be changed by the actions of the other agent, theorem 2 implies that the level trustee is trivial, gaining nothing from planning ahead. Conversely, the level investor can use a recombining tree as in theorem 1. Therefore, the chain of cognitive hierarchy steps for the investor is , and for the trustee, it is . Trustee planning is trivial until the trustee does at least reach theory of mind level . Assuming in 3.4, determined empirically from real subject data [20] for suitably noisy behaviour, our subjects are then characterized via the triplet of theory of mind level , guilt parameter and planning horizon .
3.9 Level and POMCP rollout mechanism
The level models are obtained by having the level agent always assume all partner types to be equally likely (), setting the planning horizon to , meaning the partner acts on immediate utilities only, and calculating the agent’s expected utilities after marginalizing over partner types and their respective response probabilities based on their immediate utilities.
In the POMCP treatment of the multi round trust game, if a simulated agent reaches a given history for the first time, a value estimate for the new node is derived by treating the agent as level and using an greedy decision making mechanism on the expected utilities to determine their actions until the present planning horizon.
4 Results
We adapted the POMCP algorithm [31] to solve IPOMDPs [23], and cast the multiround trust task as an IPOMDP that could thus be solved. We made a number of approximations that were prefigured in past work in this domain [24, 20], and also made various observations that dramatically simplified the task of planning, without altering the formal solutions. This allowed us to look at longer planning horizons, which is important for the full power of the intentional modeling to become clear.
Here, we first seek to use this new and more powerful planning method to understand the classes of behaviour that arise from different settings of the parameters in section 4.3. From the study of human interactions [14], the importance of coaxing (returning more than the fair split) has been established. From our own study of the data collected so far, we define four coarse types of ’pure’ interactions, which we call ”Cooperation”, ”Coaxing to Cooperation”, ”Coaxing to Exploitation”, ”Greedy” ; we conceptualize how these might arise. We also delimit the potential consequences of having overly restricted the planning horizon in past work in this domain, and examine the qualitative interactive signatures (such as how quickly average investments and repayments rise or fall) that might best capture the characteristics of human subjects playing the game.
We then continue to discuss the quality of statistical inference, by carrying out model inversion for our new method in section 4.7 and comparing to earlier work in this domain [24].
Finally, we treat real subject data collected for an earlier study ([20]) in section 4.9 and show that our new approach recovers significant behavioural differences not obtained by earlier models and offers a significant improvement in the classification of subject behaviour through the inclusion of the planning parameter in the estimation and the quality of estimation on the trustee side.
The materials used in this section, as well as the code used to generate them, can be found on Andreas Hula’s github repository. All material was generated on the local WTCN cluster. We used R [38] and Matlab [39] for data analysis and the boost C++ libraries [40] for code generation.
4.1 Modalities
All simulations were run on the local cluster at the Wellcome Trust Centre for Neuroimaging. For sample paths and posterior distributions, for each pairing of investor guilt, investor sophistication and trustee guilt and trustee sophistication, full games of exchanges each were simulated, totaling games. Additionally, in order to validate the estimation, a uniform mix of all parameters was used, implying a total of full games.
To reduce the variance of the estimation, we employed a presearch method. Agents with ToM greater than
first explored the constant strategies (offering/returning a fixed fraction) to obtain a minimal set of values from which to start searching for the optimal policy using SoftUCT. This ensures that inference will not ”get stuck” in a closetooptimal initial offer just because another initial offer was not adequately explored. This is more specific than just increasing the exploration bonus in the SoftUCT rule, which would diffuse the search during all stages, rather than helping search from a stable initial grid.We set a number of simulations for the initial step, where the beliefs about the partner are still uniform and the time horizon is still furthest away. We then reduce the number of simulations as the time horizon approaches .
4.2 Simulation And Statistical Inference
Unless stated otherwise, we employ an inverse temperature in the softmax of (noting the substantial scale of the rewards). The exploration constant for POMCP was set to . The initial beliefs were uniform , for each subject. For the possible guilt types we use the following expression while in text: is ”greedy”, is ”pragmatic” and is ”guilty”. However, on all the graphs, we give the exact model classification in the form for the investor and for the trustee.
We present average results over multiple runs generated stochastically from each setting of the parameter values. In the figures, we report the actual characteristics of investor and trustee; however, in keeping with the overall model, although each agent knows their own parameters, they are each inferring their opponents’ degree of guilt based on their initial priors.
As a consequence of the observation in section 3.8.1, we only consider for the investor and for the trustee. Planning horizons are restricted to , as noted in section 3.7, with the level trustee always having a planning horizon of .
Actions for both agents are parametrized as in section 3.7 and averaged across identical parameter pairings. In the graphs, we show actions in terms of the percentages of the available points that are offered or returned. For the investor, the numerical amounts can be read directly from the graphs; for the trustee, these amounts depend on the investor’s action. In the figures, we report the actual characteristics of investor and trustee; however, in keeping with the overall model, although each player knows their own parameters, they are each inferring their opponents’ degree of guilt based on their initial priors.
Dual to generating behaviour from the model is to invert it to find parameter settings that best explain observed interactions [24, 20]. Conceptually, this can be done by simulating exchanges between partners of given parameter settings , taking the observed history of investments and responses, and using a maximum likelihood estimation procedure which finds the settings for both agents that maximise the chance that simulated exchanges between agents possessing those values would match the actual, observed exchange. We calculate the action likelihoods through the POMCP method outlined in the earlier section 3.2 and accumulate the negative log likelihoods, looking for the combination that produces the smallest negative loglikelihood. This is carried out for each combination of guilt and sophistication for both investor and trustee.
4.3 Paradigmatic Behaviours
Figures 11 (with the additional outcome comparison in figure 12), and figures 13 and 14 show the three characteristic types of behaviour, in each case for two sets of parameters for investor and trustee. The upper graphs show the average histories of actions of the investor ( blue) and trustee (red) across the rounds; the middle graphs show the mean posterior distributions over the three guilt parameters () as estimated by the investor and the lower graphs show the mean posterior distribution by the trustee (right) at four stages in the game (rounds 0, 3, 6 and 9). These show how well the agents of each type are making inferences about their partners.
Figure 11 shows evidence for strong cooperation between two agents who are characterized by high inequity aversion (i.e., guilty). Cooperation develops more slowly for agents with shorter (left) than longer (right) planning horizons, enabling a reliable distinction between different guilty pairs. This is shown more explicitly in figure 12 in terms of the total amount of money made by both participants. Both cases can be seen as cases of a tit for tat like approach by the players, although unlike a strict tit for tat mechanism the process leading to high level cooperation is generally robust against following below par actions by either player. Rather, high level players would employ coaxing to reinforce cooperation in this case. This is true even for lower level players, as after they have formed beliefs of the partner, they will not immediately reduce their offers upon a few low offers or returns, due to the Bayesian updating mechanism.
The posterior beliefs show both partners ultimately inferring the other’s guilt type correctly in both pairings, however the investors remain aware of the possibility that the partners may actually be pragmatic and therefore the high level long horizon investors are prone to reduce their offers preemptively towards the end of the game. This data feature was noted in particular in the study [20] and our generative model provides a generative explanation for it, based on the posterior beliefs of higher level agents explained above.
Figure 13 shows that level trustees employ coaxing (returning more than the fair split) to get the investor to give higher amounts over extended periods of time. In the example settings, the level investor completely falls for the trustee’s initial coaxing (left), coming to believe that the trustee is guilty rather than pragmatic until towards the very end. However, the level investor (right) remains cautious and starts reducing offers soon after the trustee gets greedy, decreasing their offers faster than if playing a truly guilty type. The level investor on average remains ambiguous between the partner being guilty or pragmatic. Either inference prevents them from being as badly exploited as the level investor.
In these plots, investor and trustee both have long planning horizons; we later show what happens when a trustee with a shorter horizon () attempts to deceive.
A level trustee can also get pragmatic investors to cooperate through coaxing, as demonstrated in figure 14. The returns are a lot higher than for a level guilty trustee, who lacks a model of their influence on the investor, and hence does not return enough to drive up cooperation. This initial coaxing is a very common behaviour of high level healthy trustees, trying to get the investor to cooperate more quickly, for both guilty and pragmatic high level trustees.
4.4 Inconsistency or Impulsivity
Trustees with planning horizon tend to find it difficult to maintain deceptive strategies. As can be seen in figure 15, even when both agents have a planning horizon of , a short sighted trustee builds significantly less trust than a long sighted one. This is because it fails to see sufficiently far in the future, and exploits too early. This planning horizon thus captures cognitive limitations or impulsive behaviour, while the planning horizon of generally describes the consistent execution of a strategy during play. Such a distinction may be very valuable for the study of clinical populations suffering from psychiatric disorders such as attention deficit hyperactivity disorder (ADHD) or borderline personality disorder (BPD), who might show high level behaviours, but then fail to maintain them over the course of the entire game. Inferring this requires the ability to capture long horizons, something that had eluded previous methods. This type of behaviour shows how important the availability of different planning horizons is for modeling, as earlier implementations such as [24] would treat this impulsive type as the default setting.
4.5 Greedy Behaviour
Another behavioural phenotype with potential clinical significance arises with fully greedy partners, see figure 16. Greedy low level investors only invest very little, even if trustees try to convince them of a high guilt type on their part as described above (coaxing). Cooperation repeatedly breaks, which is reflected in the high variability of the investor trajectory. Two high level greedy types initially cooperate, but since the greedy trustee egregiously overexploits, cooperation usually breaks down quickly over the course of the game, and is not repaired before the end. In the present context, the greedy type appears quite pathological in that they seem to hardly care at all about their partners’ type. The main exception to this is the level greedy investor (an observation that underscores how theory of mind level and planning can change behaviour that would seem at first to be hard coded in the inequality aversion utility function). The level greedy investor will cause cooperation to break down, regardless of their beliefs, as in figure 16 the posterior beliefs of the level show that they believe the trustee to be guilty, but do not alter their behaviour in the light of this inference.
4.6 Planning Mismatch  High Level Deceived By Lower Level
In figure 17, the investor is level , and so should have the wherewithal to understand the level trustee’s deception. However, the trustee’s longer planning horizon permits her to play more consistently, and thus exploit the investor for almost the entire game. This shows that the advantage of sophisticated thinking about other agents can be squandered given insufficient planning, and poses an important question about the efficient deployment of cognitive resources to the different demands of modeling and planning of social interactions.
4.7 Confusion
4.7.1 Model Inversion
A minimal requirement for using the proposed model to fit experimental data is selfconsistency. That is, it should be possible to recover the parameters from behaviour that was actually generated from the model itself. This can alternatively be seen as a test of the statistical power of the experiment  i.e., whether rounds suffice in order to infer subject parameters. Figure 18
shows the confusion matrix which indicates the probabilities of the inferred guilt (top), ToM (middle) and planning horizon (bottom) for investor (left) and trustee (right), in each case marginalizing over all the other factors. We discuss a particular special case of the obtained confusion in
19. Said confusion relates to observations made in empirical studies (see [19, 20]) and suggests the notion of the planning parameter, as measure of consistency of play. Later, we show comparative data reported in the study [24], which only utilized a fixed planning horizon of and guilt states (and did not exploit the other simplifications that we introduced above), see figure 20 for a depiction of the levels of confusion in that study. These simplifications implied that the earlier study would find recovery of theory of mind in particular to be harder.Guilt is recovered in a highly reliable manner. By contrast, there is a slight tendency to overestimate ToM in the trustees. The greatest confusion turns out to be inferring a investor as having when playing an impulsive trustee (), a problem shown more directly in Figure 19.
The issue is that when the trustee is impulsive, farsighted investors () can gain no advantage over nearsighted ones (), and so the choices of this dyad lead to misestimation. Alternatively put, an impulsive trustee brings the investor down to his or her level. This has been noted in previous empirical studies, notably [19, 20]’s observations of the effect on investors of playing erratic trustees. The same does not apply on the trustee side, since the reactive nature of the trustee’s tactics makes them far less sensitive to impulsive investor play.
Given the huge computational demands of planning, it seems likely that investors could react to observing a highly impulsive trustee by reducing their own actual planning horizons. Thus, the inferential conclusion shown in figure 19 may in fact not be erroneous. However, this possibility reminds us of the necessity of being cautious in making such inferences in a twoplayer compared to a oneplayer setting.
4.7.2 Confusion Comparison to earlier Work
We compare our confusion analysis to the one carried out in the grid based calculation in [24]. In [24] the authors do not report exact confusion metrics for the guilt state, only noting that it is possible to reliably recover whether a subject is characterized by high guilt () or low guilt (). We can however compare to the reported ToM level recovery. The comparison with [24] faces an additional difficulty in that despite using the same formal framework as this present work, the indistinguishability of the level and trustees and the level and investors was not identified yet. This explains the somewhat higher amount of confusion when classifying ToM levels, reported in [24]. Also, since calculation of the DirichletMultinomial probability was done numerically in this study, some between level differences will only derive from changes in quadrature points for higher levels. As can be seen in figure 20 (left), almost all of the level trustees at low guilt are misclassified. This is due to them being classified as level instead, since both levels have the same behavioral features, but apparently the numerical calculation of the belief state favored the level classification over the level classification. The tendency to overestimation is true on the investor side as well, with there being a considerable confusion between level and level investors, who should behaviorally be equivalent. In sum, this leads to the reported overestimation of the theory of mind level. We have depicted the confusion levels reported in [24] in figure 20.
4.8 Computational Issues
The viability of our method rests on the running time and stability of the obtained behaviours. In figure 21, we show these for the case of the first action, as a function of the number of simulation paths used. All these calculations were run at the local Wellcome Trust Center for Neuroimaging (WTCN) cluster. Local processor cores where of Intel Xeon E312xx (Sandy Bridge) type clocked at GHz and no process used more than GB of RAM. Note that, unless more than paths are used, calculations take less than minutes.
We quantify simulation stability by comparing simulations for a level investor (a reasonable upper bound, because the action value calculation for this incorporates the level trustee responses) based on varying numbers of paths with a simulation involving paths that has converged. We calculate the between (simulated) subject discrepancies of the probabilities for the first action for :
where are the converged probabilities, and is the action likelihood of simulated subject . If the sum of squares of the entries in the discrepancy matrix is low, then the probabilities will be close to their converged values.
As can be seen from figure 21 (right), for k paths even planning steps ahead agents have converged in their initial action probabilities, such that their action probabilities vary from the converged value by no more than about . However, note that this convergence is not always monotonic in either the planning horizon or the number of sample paths. The former is influenced by the differing complexity of preferences for different horizons – sometimes, actions are harder to resolve for short than long horizons. The latter is influenced by the initial presearch using constant strategies.
Although k steps suffice for convergence even when planning steps ahead, this horizon remains computationally challenging. We thus considered whether it is possible to use a shorter horizon of steps, without materially changing the preferred choices. Figure 22 illustrates that the difference is negligible compared with the fluctuations of the Monte Carlo approach, even for the worst case involving the pairing of pragmatic types, with high ToM levels and long planning horizons. At the same time, the calculation for is twice as fast as for the level investor, which even just for the first action is a difference of seconds.
4.9 Comparison To Earlier Subject Classifications
We will show below, using real subject data taken from [20], that our reduction to guilt states does not render likelihoods worse and only serves to improve classification quality. We compared the results of our new method with the results obtained in earlier studes ([24], [20]).
4.9.1 Dataset
We performed inference on the same data sets as in Xiang et al,[20] (which were partially analysed in [24],[14] and [15]). This involved dyads playing the trust game over exchanges. The investor agent was always a healthy subject, the trustees comprised various clinical groups, including anonymous, healthy trustees (the ”impersonal” group; subjects), healthy trustees who were briefly encountered before the experiment (the ”personal” group; subjects), trustees diagnosed with Borderline Personality Disorder (BPD) (the ”BPD” group; subjects), and anonymous healthy trustees matched in socioeconomic status (SES) to the (lower than healthy) SES distribution of BPD trustees, (the ”low SES” group; subjects).
4.9.2 Models Used
We compared our models to the results of the model used in [20] on the same data set (which incorporates the data set used in [24]). The study [20] uses guilt states compared to our , a planning horizon of and an inverse temperature of , otherwise the formal framework is exactly the same as in section 3.7. Action values in [20] were calculated by an exact grid search over all possible histories and a numerical integration for the calculation of the belief state. For comparison purposes we built a ”clamped” model in which the planning horizon was fixed at the value , with guilt states and a inverse temperature set to . Additionally, we compared to the outcome for the full method in this work, including estimation of the planning horizon. We noted that in the analysis in [20], an additional approximation had been made at the level investor level, which set those investors as non learning. This kept their beliefs uniform and yielded much better negative loglikelihoods within said model, than if they were learning.
4.9.3 Subject Fit
A minimal requirement to accept subject results as significant is that the negative log likelihood is significantly better than random on average at , otherwise we would not trust a model based analysis over random chance and the estimated parameters would be unreliable. This criterion is numerically expressed as a negative loglikelihood of for exchanges, calculated from possible actions at a probability of each, with independent actions each round.
For the analysis in [20], we found that the special approximation made in [20] allowed for significantly better negative log likelihoods (mean ); if this approximation is removed, the investor data fit at an inverse temperature of would be worse than random for this data set. Additionally, the model used in [20] did not fit the trustee data significantly better than random at (mean negative loglikelihoods and standard deviation of ).
Conversely, for both our clamped and full model analysis at , the trustee likelihood is significantly better than random ( at the full model) and the investor negative loglikelihood is slightly better on average (smaller) than found in [20] with guilt states ( for our method, vs ). This confirms that reducing the number of guilt states to only reduces confusion and does not worsen the fit of real subjects data. Additionally, it becomes newly possible to perform modelbased analyses on the BPD trustee guilt state distribution, since the old model did not fit trustees significantly better than random at .
The seemingly low inverse temperature at is a consequence of the size of the rewards and the quick accumulation of higher expectation values with more planning steps, as the inverse temperature needs to counter balance the expectation size to keep choices from becoming nearly deterministic. Average investor reward expectations (at the first exchange) for planning steps stand at with an average being added at each planning step.
4.9.4 Marginal Parameter Distributions Significant Features
Figure 23 shows the significant parameter distribution differences (KolmogorovSmirnov two sample test, ). For investor theory of mind and trustee guilt distribution, many of the same differences are significant for the analysis reported in [20] (see Fig. 23, upper panels), for an analysis using our model with a ”clamped” planning horizon of steps ahead (see Fig. 23, middle panels, to match with the approach of [24]) and for our full model, using guilt states, ToM level up to and planning horizons (see Fig. 23, bottom panels and Fig. 24). We find significantly lowered ToM in most other groups, compared to the impersonal control group. We find a significantly lowered guilt distribution in BPD trustees, however the guilt difference was not used for fMRI analysis in [20], because, as noted above, the trustee was not fit significantly better than random at in the earlier model. For our full model with planning values, we find additional significant differences on the investor side: While all ToM distributions are significantly different from the impersonal condition, the planning difference between the personal and impersonal conditions is not significant at , while it is significant for the other groups (see Fig. 24). Thus, this is the only model keeping the parameter distribution of the personal group distinct from both the impersonal group (from which it is not significantly different in the clamped model) and the low SES playing controls and BPD playing controls (from which it is not significantly different based on the parameters in [20]) at the same time.
This supports the planning horizon as a ”consistency of play” and additional rationality measure, as the subjects do not think about possible partner deceptions as much in the personal condition, having just met the person they will be playing (resulting in lowered ToM). However, their play is non disruptive, if low level, and consistent exchanges result. BPD and low SES trustees however disrupt the partners’ play, lowering their planning horizon.
5 Discussion
We adapted the MonteCarlo tree search algorithm designed for partially observable Markov decision processes [31] to the interactive, gametheoretic, case [23]. We provide significant simplifications to the case of dyadic social exchange, which benefit any IPOMDP based method. We illustrated the power of this method by extending the computationally viable planning horizon in a complex, multiround, social exchange game to be able to encompass characteristic behaviours that have been seen in human play [14].
We also showed that the 10 rounds that had been used empirically suffice to license high quality inference about parameter values, at least in the case that the behaviour was generated from the model itself. We exhibited three fundamental forms of dynamical behaviour in the task: cooperation, and two different varieties of coaxing. The algorithm generates values, stateaction values and posterior beliefs, all of which can be used for such methods as modelbased fMRI.
We find that the results in 4.4, 4.6 and figures 19 and 24 confirm the planning horizon as a consistency of play parameter, that encodes the capability of a subject to execute a consistent strategy throughout play. As such it may be disrupted by the behavior of shorter planning partners, as can be seen in 19 and 24.
Furthermore, comparing to earlier data used in the work [20] we can confirm the relevance of the planning parameter in the treatment of real subject data, classifying subject groups along the new axis of consistency of play.
The newly finer classification of subjects along the three axes of theory of mind, planning horizon and guilt should provide a rich framework to classify deficits in clinical populations such as an inability to model other people’s beliefs or intentions, ineffective modelbased reasoning, and a lack of empathy. Such analyses can be done at speed, of the order of 10s of subjects per hour.
One might ask whether the behavioural patterns derived in this work might be obtained without invoking the cognitive hierarchy and instead using a large enough state space, which encodes the preferences and sophistication of the other agent as many separate states, rather than a few type parameters plus the cognitive hierarchy. This is in principle possible, however we prefer ToM for reasons: Firstly, the previous study [20] and others have found neural support for the distinction between high ToM and low ToM subjects in real play, suggesting that this distinction is not but a mathematical convenience (cf. [20], p. and for a neural representation of prediction errors associated to level and level thinking). Secondly, we can specify features of interest, such as inequality aversion and planning at the lowest level, then generate high level behaviours in a way that yields an immediate psychological interpretation in terms of the mentalization steps encoded in the ToM level.
The algorithm opens the door to finer analysis of complicated social exchanges, possibly allowing optimization over initial prior values in the estimation or the analysis of higher levels of theory of mind, at least on tasks with lower fanout in the search tree. It would also be possible to search over the inverse temperature .
One important lacuna is that although it is straightforward to use maximum likelihood to search over fixed parameters (such as ToM level, planning horizon or indeed temperature), it is radically harder to perform the computations that become necessary when these factors are incorporated into the structure of the intentional models. That is, our subjects were assumed to make inferences about their opponent’s guilt, but not about their theory of mind level or planning horizon.
It is possible that additional tricks would make this viable for the trust task, but it seems more promising to devise or exploit a simpler game in which this would be more straightforward.
6 Acknowledgements
The authors would like to thank James Lu, Johannes Heinrich, Terry Lohrenz and Arthur Guez for helpful discussions and Xiaosi Gu, Michael Moutoussis, Tobias Nolte and Iris Vilares for comments on the manuscript. Special thanks go to Andreas Morhammer, who brilliantly advised on several issues with C++, as well as the IT support staff at the Wellcome Trust Center for Neuroimaging and Virgina Tech Carilion Research Institute. The authors gratefully acknowledge funding by the Wellcome Trust (Read Montague) under a Principal Research Fellowship, the Kane Foundation (Read Montague) and the Gatsby Charitable Foundation (Peter Dayan). Andreas Hula is supported by the Principal Research Fellowship of Professor Read Montague.
References
 [1] E Fehr and K.M. Schmidt. A theory of fairness, competition, and cooperation. Q J Econ, 114:817–868, 1999.
 [2] E. Fehr and S. Gächter. Fairness and Retaliation: The Economics of Reciprocity. J Econ Perspect, 14:159–181, 2000.
 [3] E. Fehr and U. Fischbacher. The nature of human altruism. Nature, 425:785–791, 2003.
 [4] E. Fehr and U. Fischbacher. Social norms and human cooperation. Trends Cogn Sci, 8(4):185–190, 2004.
 [5] C.F. Camerer. Behavioral Game theory: Experiments in Strategic Interaction. Princeton University Press, Princeton, New Jersey, 2003.
 [6] K. McCabe, M.L. Rigdon, and V. Smith. Positive Reciprocity and Intentions in Trust Games . J Econ Behav Organ, 52(2):267–275, 2003.
 [7] J.F. Nash. Equilibrium points in nperson games. Proc. Natl. Acad. Sci. USA, 36:48–49, 1950.
 [8] J.C. Harsanyi. Games with incomplete information played by ”Bayesian” players. Manage Sci, 14:159–182, 1967.
 [9] M. CostaGomes, V. Crawford, and B. Broseta. Cognition and behavior in normalform games: An experimental study . Econometrica, pages 1193–1235, 2001.
 [10] C.F. Camerer, T.H. Ho, and J.K. Chong. A cognitive hierarchy model of games. Q J Econ, 119:861–898, 2004.
 [11] W. Yoshida, R.J. Dolan, and K.J. Friston. Game theory of mind. PLoS Comput Biol, 4:e1000254, 2008.
 [12] A.G. Sanfey. Social decisionmaking: insights from game theory and neuroscience. Science, 318:598–602, 2007.
 [13] R.D. McKelvey and T.R. Palfrey. An Experimental Study of the Centipede Game. Econometrica, pages 803–836, 1992.
 [14] B. KingCasas, C. Sharp, L. Lomax, T. Lohrenz, P. Fonagy, and al. The rupture and repair of cooperation in borderline personality disorder. Science, 321, 2008.
 [15] M. Koshelev, T. Lohrenz, M. Vannucci, and P.R. Montague. Biosensor Approach to Psychopathology Classification. PLoS Comput Biol, 6(10):e1000966, 2010.
 [16] P.H. Chiu, M.A. Kayali, K.T. Kishida, D. Tomlin, L.G. Klinger, and al. Self responses along cingulate cortex reveal quantitative neural phenotype for highfunctioning autism. Neuron, 57:463–473, 2008.
 [17] K.T. Kishida, B KingCasas, and P.R. Montague. Neuroeconomic approaches to mental disorders. Neuron, 67(4):543–554, 2010.
 [18] E. Fehr and C.F. Camerer. Social neuroeconomics: the neural circuitry of social preferences. Trends Cogn Sci, 11:419–427, 2007.
 [19] B. KingCasas, D. Tomlin, C. Anen, C.F. Camerer, S.R. Quartz, and al. Getting to know you: Reputation and Trust in a twoperson economic exchange. Science, 308:78–83, 2005.
 [20] T. Xiang, R. Debajyoti, T. Lohrenz, P.R. Montague, and P. Dayan. Computational Phenotyping of TwoPerson Interactions Reveals Differential Neural Response to DepthofThought. PLoS Comput Biol, 8(12):e1002841, 2012.
 [21] D. Lee. Game theory and neural basis of social decision making. Nat. Neurosci, 11:404–409, 2008.
 [22] K. McCabe, D. Houser, L. Ryan, V. Smith, and T. Trouard. A functional imaging study of cooperation in twoperson reciprocal exchange. Proc. Natl. Acad. Sci. USA, 98(20):11832–35, 2001.
 [23] P.J. Gmytrasiewicz and P. Doshi. A Framework for Sequential Planning in MultiAgent Settings. J Artif Intell Res, 24:49–79, 2005.
 [24] R. Debajyoti, B. KingCasas, P.R. Montague, and P. Dayan. Bayesian Model of Behaviour in Economic Games. NIPS, 21:1345–1353, 2008.
 [25] M.L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 2005.
 [26] M.L. Kaelbling, M.L. Littman, and A.R. Cassandra. Planning and acting in partially observable stochastic domains. Artif Intell, 101:99–134, 1995.
 [27] T. Xiang, T. Lohrenz, and P. R. Montague. Computational substrates of norms and their violations during social exchange. J Neurosci, 33(3):1099–1108, 2013.

[28]
J. Gläscher, N. Daw, P. Dayan, and J. P. O’Doherty.
States versus Rewards: Dissociable Neural Prediction Error Signals Underlying ModelBased and ModelFree Reinforcement Learning.
Neuron, 66(4):585–595, 2010.  [29] Q. J. Huys, R. Cools, M. Gölzer, E. Friedel, A. Heinz, and al. Disentangling the Roles of Approach, Activation and Valence in Instrumental and Pavlovian Responding. PLoS Comput Biol, 7(4):e1002028, 2011.
 [30] R. McKelvey and T. Palfrey. Quantal Response Equilibria for Extensive Form Games. Experimental Economics, 1:9–41, 1998.
 [31] D. Silver and J. Veness. Monte Carlo Planning in Large POMDPs. NIPS, 23:2164–2172, 2010.

[32]
L. Kocsis and C. Szepesvári.
Bandit based MonteCarlo Planning.
15th European Conference on Machine Learning
, pages 282–293, 2006.  [33] P. Auer, N. CesaBianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(23):235–256, 2002.
 [34] P. Auer, N. CesaBianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32:48–77, 2002.
 [35] M. Wunder, M. Kaisers, J.R. Yaros, and M. Littman. Using iterated reasoning to predict opponent strategies. 10th International Conference on Autonomous Agents and Multiagent Systems, 2:593–600, 2011.
 [36] Y. Nyarko. Convergence in Economic Models with Bayesian Hierarchies of Beliefs. Journal of Economic Theory, 1997.

[37]
P.J. Gmytrasiewicz and P. Doshi.
On the difficulty of achieving Equilibrium in Interactive POMDPs.
International Symposium on Artificial Intelligence and Mathematics (ISAIM)
, 2006.  [38] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. ISBN 3900051070.
 [39] MATLAB. The MathWorks Inc., Natick, Massachusetts, 2010.
 [40] BoostLibraries. 2014. http://www.boost.org.
Comments
There are no comments yet.