Introduction
In multiagent systems, the ideal behavior of an agent is contingent on the behaviors of coexisting agents. However, agents may exhibit different behaviors adaptively depending on the contexts they encounter. Hence, it is critical for an agent to quickly predict or recognize the behaviors of other agents, and make a best response accordingly [Boutilier1999, Powers and Shoham2005, Gmytrasiewicz and Doshi2005, Banerjee and Peng2005, Fernández and Veloso2006, Chakraborty and Stone2010, Fernández, García, and Veloso2010, Wunder et al.2012, Bard et al.2013, Elidrisi et al.2014, HernandezLeal and Kaisers2017, Albrecht and Stone2018, HernandezLeal et al.2017a].
One efficient way of recognizing the strategies of other agents is to leverage the idea of Bayesian Policy Reuse (BPR) [Rosman, Hawasly, and Ramamoorthy2016], which was originally proposed for multitask learning problems to determine the best policy when faced with different tasks. [HernandezLeal et al.2016] proposed BPR+ by extending BPR to multiagent learning settings to detect the dynamic changes of an opponent’s strategies. BPR+ also extends BPR with the ability of generating new policies online against an opponent using previously unseen policies. However, BPR+ is designed for singlestate repeated games only. Recently, [HernandezLeal and Kaisers2017] proposed BayesPepper by combing BPR and Pepper framework [Crandall2012] to handle stochastic games. However, all these approaches assume that an opponent randomly switches its policies among a number of stationary policies. In practice, an opponent can exhibit more sophisticated behaviors by adopting more advance strategies, e.g., using BPR as well. A BPRbased agent usually cannot behave well against itself, which is also confirmed in our experiments. In such situations, higherlevel reasonings are needed and more advanced techniques are required for an agent to predict and play well against such kinds of sophisticated opponents.
The above problem can be partially addressed by introducing the concept of Theory of Mind (ToM) [Goldman2012]. ToM is a kind of recursive reasoning technique [HernandezLeal et al.2017a, Albrecht and Stone2018] describing a higher cognitive mechanism of explicitly attributing unobservable mental contents such as beliefs, desires, and intentions to other players. Previous methods of recursive reasoning use explicit representations of nested beliefs and “simulate” the reasoning processes of other agents to predict their actions [Gmytrasiewicz and Durfee2000, Gmytrasiewicz and Doshi2005, Wunder et al.2011, Wunder et al.2012]. However, these approaches show no adaptation to nonstationary opponents. One exception is a reasoning framework proposed by [De Weerd, Verbrugge, and Verheij2013], this framework enables an agent to predict the opponent’s actions explicitly by building an abstract model of the opponent’s behaviors using recursive nested beliefs. Additionally, they use a confidence value to help an agent to adapt to different opponents. However, the main drawbacks of this model are: 1) it works only if an agent holds exactly one more layer of belief than its opponent; 2) it is designed for predicting the opponent’s primitive actions instead of highlevel strategies, resulting in slow adaptation to nonstationary opponents; 3) it shows poor performance against an opponent using previously unseen strategies.
To address the above challenges, we propose a novel algorithm, Bayesian Theory of Mind on Policy (BayesToMoP), which leverages the predictive power of BPR and ToM, to compete against sophisticated opponents. BayesToMoP can quickly and accurately predict an opponent’s behavior and compute a best response accordingly. Besides, BayesToMoP also supports detecting whether an opponent is using a previously unseen policy and learning an optimal response against it. Furthermore, BayesToMoP can be straightforwardly extended to Deep Reinforcement Learning (DRL) environment with a neural network as the value function approximator, termed as deep BayesToMoP. Experimental results show that both BayesToMoP and deep BayesToMoP outperform the stateoftheart BPR approaches when faced with different types of opponents in twoagent competitive games.
Background
Bayesian Policy Reuse BPR was originally proposed in [Rosman, Hawasly, and Ramamoorthy2016] as a framework for an agent to quickly determine the best policy to execute when faced with an unknown task. Given a set of previouslysolved tasks and an unknown task , the agent is required to select the best policy from the policy library within as small number of trials as possible. BPR uses the concept of belief
, which is a probability distribution over the set of tasks
, to measure the degree to which matches the known tasks based on the signal . A signal can be any information that is correlated with the performance of a policy (e.g., immediate rewards, episodic returns). BPR involves performance models of policies on previouslysolved tasks, which describes the distribution of returns from each policy on previouslysolved tasks. A performance model is a probability distribution over the utility of a policy on a task .In order to select the best policy, a number of BPR variants with exploration heuristics are available, e.g., probability of improvement (BPRPI) heuristic and expected improvement (BPREI) heuristic. BPRPI heuristic utilizes the probability with which a specific policy can achieve a hypothesized increase in performance over the current best estimate:
. Formally, it chooses the policy most likely to achieve the utility : , where . However, it is not straightforward to determine the appropriate value of , thus another way of avoiding this issue is BPREI heuristic, which selects the policy most likely to achieve any possible utilities of improvement : . [Rosman, Hawasly, and Ramamoorthy2016] showed BPREI heuristic performs best among all BPR variants. Therefore, we choose BPREI heuristic for playing against different opponents.Theory of Mind ToM model [De Weerd, Verbrugge, and Verheij2013] is used to predict an opponent’s action explicitly by building an abstract model of the opponent’s behaviors using recursive nested beliefs. ToM model is described in the context of a twoplayer competitive game where an agent and its opponent differ in their abilities to make use of ToM. The notion of ToM indicates an agent that has the ability to use ToM up to the th order. [De Weerd, Verbrugge, and Verheij2013] showed that the reasoning levels deeper than 2 do not provide significant benefits, so we briefly introduce the first two orders of ToM models.
A zeroorder ToM (ToM) agent holds its zeroorder belief in the form of a probability distribution on the action set of its opponent. The ToM agent then chooses the action that maximizes its expected payoff. A firstorder ToM (ToM) agent keeps both zeroorder belief and firstorder belief. The firstorder belief is a probability distribution that describes what the ToM agent believes its opponent believes about itself. The ToM agent first predicts its opponent’s action under its firstorder belief. Then, the ToM agent integrates its firstorder prediction with the zeroorder belief and uses this integrated belief in the final decision.
BayesToMoP
Motivation
Previous works [HernandezLeal et al.2016, HernandezLeal and Kaisers2017, HernandezLeal et al.2017b] assume that an opponent randomly switches its policies among a number of stationary policies. However, a more sophisticated agent may choose when to change its policy in a more principled way. For instance, it first predicts the policy of its opponent and then best responds towards the estimated policy accordingly. If the opponent’s policy is estimated by simply counting the action frequencies, it is then reduced to the wellknown fictitious play algorithm [Shoham and LeytonBrown2009]. However, in general, an opponent’s action information may not be observable during interactions. One way of addressing this problem is using BPR [Rosman, Hawasly, and Ramamoorthy2016], which uses Bayes’ rule to predict the policy of the opponent according to the received signals (e.g., rewards), and can be regarded as the generalization of fictitious play.
Therefore, a question naturally arises: how an agent can effectively play against both simple opponents with stationary strategies and more advanced ones (e.g., using BPR)? To address this question, we propose a new algorithm called BayesToMoP, which leverages the predictive power of BPR and recursive reasoning ability of ToM to predict the strategies of such opponents and behave optimally. We also extend BayesToMoP to DRL scenarios with a neural network as the value function approximator, termed as deep BayesToMoP. In the following descriptions, we do not distinguish whether a policy is represented in a tabular form or a neural network unless necessary.
We use the notion of BayesToMoP to indicate an agent with the ability of using theory of mind up to the th order. Intuitively, BayesToMoP with a higherorder ToM could take advantage of any BayesToMoP with a lowerorder ToM (). Without loss of generality, we focus on BayesToMoP and BayesToMoP. BayerTomoP () can be naturally constructed by incorporating a higherorder theory of mind idea into our framework. Another reason that we focus on zero and firstorder ToMoP is that it has been found that the benefit of considering deeper recursion (higherlevel ToM, ) is marginal [De Weerd, Verbrugge, and Verheij2013]. BayesToMoP is capable of learning a new policy against its opponent if it detects the opponent is using a previously unknown strategy. In contrast, the above two features are not satisfied by the previous ToM model which works only if an agent holds exactly one more layer of belief than its opponent.
BayesToMoP Algorithm
We start with the simplest case of BayesToMoP. BayesToMoP extends ToM by incorporating Bayesian reasoning techniques to predict the strategy of an opponent. BayesToMoP holds a zeroorder belief about its opponent’s strategies , each of which is a probability that its opponent may adopt each strategy : . . Give a utility , a performance model describes the possibility of an agent using a policy against an opponent’s strategy .
For BayesToMoP agent (Algorithm 1), it starts with its policy library and its opponent’s strategy library , its performance models and the zeroorder belief . Then, in each episode, given the current belief , BayesToMoP agent evaluates the expected improvement utility defined following BPREI heuristic for all policies and then selects the optimal one (Line 2). Next, BayesToMoP agent updates its zeroorder belief using Bayes’ rule (Line 46). At the end of the episode, BayesToMoP detects whether its opponent is using a previously unseen policy. If yes, it learns a new policy against its opponent (Line 7). The new strategy detection and learning algorithm will be described in Section New Opponent Detection and Learning in detail.
Finally, note that without the new strategy detection and learning phase (Line 7), BayesToMoP agent is essentially equivalent with BPR [Rosman, Hawasly, and Ramamoorthy2016] since they both first predict the opponent’s strategy (or taks type) and then select the optimal policy, each strategy of the opponent here can be regarded as a task in the original BPR. Besides, the full BayesToMoP is essentially equivalent with BPR+ [HernandezLeal et al.2016] since both can handle previously unseen strategies.
BayesToMoP Algorithm
Next, we move to BayesToMoP algorithm. Apart from its zeroorder belief, BayesToMoP also maintains a firstorder belief, which is a probability distribution that describes the probability that an agent believes his opponent believes it will choose a policy . BayesToMoP makes a prediction of its opponent’s policy based on its firstorder belief. However, this prediction may conflict with its zeroorder belief. To address this conflict, BayesToMoP holds a firstorder confidence serving as the weighting factor to balance the influence degree between its firstorder prediction and zeroorder belief.
The overall strategy of BayesToMoP is shown in Algorithm 2. Given the policy library and , performance models and , zeroorder belief and firstorder belief , BayesToMoP agent first predicts the policy of its opponent assuming the opponent maximizes the utility based on BPREI heuristic under its firstorder belief (Line 2). Then, with the firstorder predicted policy and its zeroorder belief , the integration function is introduced to compute the final prediction results which is defined as the linear combination of the firstorder prediction and zeroorder belief weighted by the confidence degree as follows (Line 3):
(1) 
Next, BayesToMoP agent computes the expected utility improvement based on the performance models and the integrated belief to derives an optimal policy (Line 4). At last, BayesToMoP agent updates its firstorder belief and zeroorder belief using Bayes’ rule (Line 511).
The next issue is how to adaptively update the firstorder confidence degree . Initially we are not sure whether an opponent is making decisions using ToM reasoning or simply switching among stationary strategies. The value of can be understood as the exploration rate of using firstorder belief to predict the opponent’s strategies. In previous ToM model [De Weerd, Verbrugge, and Verheij2013], the value of is increased when the firstorder prediction is correct and decreased otherwise based on the assumption that an agent can observe the actions and rewards of its opponent. However, in our settings, the prediction works on higher level of behaviors (the policies), which usually are not available (agents are not willing to reveal their policies to others to avoid being exploited in competitive environments). Therefore, we propose using game outcomes as the signal to indicate whether our previous predictions are correct and adjust the firstorder confidence degree accordingly. In a competitive environment, we can distinguish game outcomes into three cases: win, lose or draw by comparing two agents’ rewards. Thus, the value of is increased when the agent’s reward is larger than its opponent’s reward and decreased otherwise by an adjustment rate of :
(2) 
Following this heuristic, BayesToMoP can easily take advantage of BayesToMoP since it can well predict which policy BayesToMoP will take in advance. However, this does not work when it plays against less sophisticated agents, e.g., an agent switching among several stationary policies without the ability of using ToM. This is due to the fact that the curve of becomes oscillating when it is faced with an agent who is unable to make use of ToM, thus fails to predict the opponent’s behaviours accurately.To this end, we propose an adaptive and generalized mechanism to adjust the confidence degree and further detect the switches of the opponent’s strategies.
We first introduce the concept of winning rate during a fixed length of episodes to adaptively adjust the confidence degree . Since BayesToMoP agent assumes its opponent is BayesToMoP at first, the value of controls the number of episodes before considering its opponent may switch to a less sophisticated type. If the current episode’s performance is better than the previous episode (), we increase the weight of using firstorder prediction, i.e., increasing the value of with an adjustment rate ; if is less than but still higher than a threshold , it indicates the performance of the firstorder prediction diminishes. Then BayesToMoP decreases the value of quickly with a decreasing factor ; if , the rate of exploring firstorder belief is set to 0 and only zeroorder belief is used for prediction. Formally we have:
(3) 
where is the threshold of the winning rate , which reflects the lower bound of the difference between its prediction and its opponent’s actual behaviors. is an indicator function to control the direction of adjusting the value of . Intuitively, BayesToMoP detects the switching of its opponent’s strategies at each episode and reverses the value of whenever its winning rate is no larger than :
(4) 
Finally, at the end of each episode, BayesToMoP learns a new optimal policy following Algorithm 3 if it detects a new opponent strategy (detailed in next section).
New Opponent Detection and Learning
The new opponent detection and learning component is the same for all BayesToMoP agents () (Algorithm 3). BayesToMoP first detects whether its opponent is using a new strategy (Line 1). This is achieved by recording a length of game outcomes and checking whether its opponent is using one of the known strategies under the performance models at each episode. In details, BayesToMoP keeps a length of memory recording the game outcomes (win, lose or draw) at each episode , and uses the winning rate over the most recent episodes as the signal indicating the performance under the current policy library. If the winning rate is lower than a given threshold (), it indicates that all existing policies show poor performance against the current opponent strategy even with the confidence degree adjustment mechanism, in this way BayesToMoP agent infers that the opponent is using a previously unseen policy outside the current policy library.
Since we can easily obtain the average winning rate of each policy against each known opponent strategy , the lowest winning rate among the best response policies () can be seen as an upper bound of the value of . The value of controls the number of episodes before considering whether the opponent is using a previously unseen strategy. Note that the accuracy of detection is increased with the increase of the memory length , however, a larger value of would necessarily increase the detection delay. The optimal memory length is determined empirically through extensive simulations.
After detecting the opponent is using a new strategy, the agent begins to learn a new policy against it (Line 2). Following previous work [HernandezLeal et al.2016], we adopt the same assumption that the opponent will not change its strategy during the learning phase (a number of rounds). Otherwise, the learning process may not converge. We adopt the traditional modelbased RL: Rmax [Brafman and Tennenholtz2002] to compute the optimal policy. Specifically, once a new strategy is detected, Rmax estimates the state transition function and reward function with and . Rmax computes for all stateaction pairs and selects the action that maximizes according to greedy mechanism. For deep BayesToMoP, we apply DQN [Mnih et al.2015] to do offpolicy learning using the obtained interaction experience. DQN is a deep Qlearning method with experience replay, approximating the Qfunction using a neural network. DQN draws samples from a replay memory, the neural network predicts
by minimizing the loss function.
After obtaining the optimal policy, BayesToMoP generates new performance models as a set of Gaussian probability distributions for both itself and its opponent based on the assumption that in competitive environment, it knows the rewards of its opponent (Line 35). Next, BayesToMoP updates its belief up to th order (Line 6). Finally, it adds the new policies and to its policy library and opponent’s policy library respectively (Line 78).
Theoretical Analysis
In this section, we provide theoretical analysis that BayesToMoP can accurately detect the opponent’s strategy from a known policy library and derives an optimal response policy accordingly.
Theorem 1
(Optimality on Strategy Detection) If the opponent plays a strategy from the known policy library, BayesToMoP can detect the strategy w.p.1 and selects an optimal response policy accordingly.
Proof 1
Suppose the opponent strategy is , BayesToMoP receives a signal at time step , the belief is updated following the equation,
(5) 
Then there exists a policy , that makes the inequality establish for all . Besides, since is bounded () and monotonically increasing (), based on the monotone convergence theorem, we can easily know the limit of sequence exists. Thus, if we limit the two sides of the Equation 5, the following equation establishes,
(6) 
iif . So that BayesToMoP can detect the strategy w.p.1 and selects an optimal response policy accordingly.
As guaranteed by Theorem 1, BayesToMoP behaves optimally when the opponent uses a strategy from the known policy library. If the opponent is using a previously unseen strategy, following the new opponent strategy detection heuristic, this phenomenon can be exactly detected when the winning rate of BayesToMoP during a fixed length of period is lower than the accepted threshold. Then BayesToMoP begins to learn an optimal response policy. The performance of BayesToMoP against opponents using either known or previously unseen strategies will be extensively evaluated through empirical simulations in the next section.
Simulations
Game Settings
We evaluate the performance of BayesToMoP on various testbeds: rockpaperscissors (RPS) [v. Neumann1928, Shoham and LeytonBrown2009], soccer [Littman1994, He and BoydGraber2016] and modified packman [Goodrich, Crandall, and Stimpson2003, Crandall2012]. We consider two versions of three games with different state representations. For the first version, we discretize the state space into a few number of discrete states, in which QValues can be represented in a tabular form. We also consider the complete state space which consists of different dimensions of information, for example, states in soccer includes coordinates of two agents, the axis limits of the field. In this case, we evaluate the performance of deep BayesToMoP. RPS is a twoplayer stateless game in which two players simultaneously choose one of the three possible actions ‘rock’ (R), ‘paper’ (P), or ‘scissors’ (S). If both choose the same action, the game ends in a tie. Otherwise, the player who chooses R wins against the one that chooses S, S wins against P, and P wins against R.
Soccer (Figure 2) is a stochastic game on a grid. Two players, A and B, start at one of starting points in the left and right respectively, and can choose one of the following 5 actions: go left, right, up, down and stay. Any action that goes to grey grids or beyond the border is invalid. The circle in Figure 2 represents the ball and it is randomly assigned to one of two players initially. The possession of the ball switches when two players move to the same grid. A player scores one point if it takes the ball to its opponent’s goals. If neither player gets a score within 50 steps, the game ends with a tie. Six deterministic policies are generated for the opponent in both two soccer games. The BayesToMoP agent is equipped with a policy library of corresponding pretrained response policies. For deep BayesToMoP, DQN is used to train response policy networks. DQN has two fullyconnected hidden layers both with 20 hidden units. The output layer is a fullyconnected linear layer with a single output for 5 actions.
Modified packman (Figure 2) derived from the SGPD game [Goodrich, Crandall, and Stimpson2003, Crandall2012] is a grid game with two players A and B, who both start at one of starting points in the left and right respectively, and choose from following 5 actions: go left, right, up, down and stay. Any action that goes to blackslash grids or beyond the border is invalid. Player A scores one point if two players move to one goal simultaneously (A catches B), otherwise, it loses one point if player B moves to one goal without being caught. If neither player gets a score within 50 steps, the game ends with a tie. We construct 24 deterministic policies for the opponent and corresponding pretrained policies make up the policy library of BayesToMoP agent.
Unless otherwise mentioned, all experiments use the same parameter settings: . All results are averaged over 100 runs.
Performance against Different Opponents
We evaluate the ability of BayesToMoP and deep BayesToMoP on detecting its opponent’s strategies in comparison with stateoftheart BPR approaches (BPR+ [HernandezLeal et al.2016] and BayesPepper [HernandezLeal et al.2017b]). Three kinds of opponents are considered: (1) a BayesToMoP opponent (O); (2) an opponent that randomly switches its policy among stationary strategies and lasts for a unknown number of episodes (O) and (3) an opponent switching its strategy between stationary strategies and BayesToMoP (O). Both BPR+ and BayesPepper detect an opponent’s strategy based on the predictive power of BPR [Rosman, Hawasly, and Ramamoorthy2016]. We pretrain the response policies to make up the policy library using Qlearning (using DQN for deep BayesToMoP), this can also be built automatically using existing techniques such as Pepper [Crandall2012]. Besides, in this section, we assume an opponent only switches policies among known ones. Thus BayesPepper is equivalent with BPR+ in our setting and we use BPR to denote both strategies. We only give experiments on deep soccer due to the limit of paper length.
We first compare BayesToMoP with BPR against O on various games. We can see from Figure 4 that only BayesToMoP can quickly detect the opponent’s strategies and achieve the highest average rewards. In contrast, BPR fails against O. Same comparison results can be found for deep BayesToMoP. This is because BayesToMoP explicitly considers firstorder belief and use it to predict which policy its opponent would select and then derives an optimal policy against it. However, BPR makes decisions based on BPRPI heuristic which is essentially equivalent with BayesToMoP. Therefore, neither BPR nor BayesToMoP could take advantage of each other and the winning percentages are expected to be approximately 50%. Average winning rates shown in Table 1 also confirm our hypothesis.
Next we consider the case of playing against O, and the performance of BayesToMoP and BPR on three games are shown in Figure 4. We can see that both BayesToMoP and BPR can quickly detect which stationary strategy the opponent is using and derive the optimal policy against it. We observe that BayesToMoP requires longer time before achieving the best reward than BPR. This happens because BayesToMoP needs additional time to determine that the opponent is not using a BPRbased strategy. This phenomenon is consistent with the slightly lower winning rate of BayesToMoP than BPR depicted in Table 1.
Approaches /  

Opponents  BPR  BayesToMoP 
O  49.78%1.71%  99.72%0.16% 
O  99.37%0.72%  97.94%0.37% 
O  33.4%0.57%  98.48%0.54% 
Finally, we consider the performance of BayesToMoP and BPR against O (see Figure 9). We note that only BayesToMoP can quickly detect its opponent’s strategies and achieve the highest average rewards. In contrast, BPR fails when its opponent’s strategy switches to BayesToMoP. Table 1 shows BayesToMoP’s winning rate against O is around 98.48% and BPR’s winning rate is around 33.4%, which is consistent with the results in Figure 9 (b). Using our confidence degree adjusting mechanism, BayesToMoP can successfully identify its opponent’s type and derive an optimal policy. However, BPR only makes decisions based on the belief over the opponent’s past policies and thus cannot handle BPRbased opponents.
New Opponent Detection and Learning
In this section, we evaluate BayesToMoP against an opponent who may use previously unknown strategies. We manually design a new strategy for the opponent in soccer (i.e., moving along the down edge of the grids). The opponent starts with one of known strategies and switches to the new strategy at 1000th episode. Figure 9 and 9 show the dynamics of average rewards of BayesToMoP and deep BayesToMoP respectively. We can see that when the opponent switches to an unknown strategy, both BayesToMoP and and deep BayesToMoP can quickly detect this change, and finally learn an optimal policy.
The Influence of Key Parameters
Figure 9 depicts the impact of the memory length on the average adjustment time of BayesToMoP before taking advantage of O. We observe a diminishing return phenomenon: the average adjustment time decreases quickly as the initial increase in the memory length, but quickly render additional performance gains marginal. The average adjustment time stabilizes around 2.8 when . We hypothesize that it is because the dynamic changes of the winning rate over a relatively small length of memory may be caused by noise thus resulting in inaccurate opponent type detection. As the increase of the memory length, the judgement about its opponent’s types is more precise. However, as the memory length exceeds certain threshold, the winning rate estimation is already accurate enough and thus the advantage of further increasing the memory length diminishes.
Finally, the influence of threshold on the average adjustment time of BayesToMoP against opponent O is shown in Figure 9. We note that the average adjustment time decreases as increases, but the decrease degree gradually stabilizes when is larger than 0.7. With the increase of the value of , the winning rate decreases to more quickly when the opponent switches its policy. Thus, BayesToMoP detects the switching of its opponent’s strategies more quickly. Similar with the results in Figure 9, as the value of exceeds certain threshold, the advantage based on this heuristic diminishes.
Conclusion and Future Work
This paper presents a novel algorithm called BayesToMoP, which leverages the predictive power of BPR and recursive reasoning ability of ToM to quickly predict the behaviors of not only switching, nonstationary opponents and also more sophisticated ones (e.g., BPRbased) and behave optimally. Our BayesToMoP also enables an agent to learn a new optimal policy when encountering a previously unseen strategy. As future work, it is worth investigating how to extend BayesToMoP to multiopponent scenarios.
References
 [Albrecht and Stone2018] Albrecht, S. V., and Stone, P. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence 258:66–95.
 [Banerjee and Peng2005] Banerjee, B., and Peng, J. 2005. Efficient learning of multistep best response. In 4th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2005), July 2529, 2005, Utrecht, The Netherlands, 60–66.
 [Bard et al.2013] Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013. Online implicit agent modelling. In Proceedings of the 2013 International conference on Autonomous Agents and MultiAgent Systems, AAMAS 2013, 255–262.
 [Boutilier1999] Boutilier, C. 1999. Sequential optimality and coordination in multiagent systems. In IJCAI, volume 99, 478–485.

[Brafman and Tennenholtz2002]
Brafman, R. I., and Tennenholtz, M.
2002.
RMAX  A general polynomial time algorithm for nearoptimal
reinforcement learning.
Journal of Machine Learning Research
3:213–231.  [Chakraborty and Stone2010] Chakraborty, D., and Stone, P. 2010. Convergence, targeted optimality, and safety in multiagent learning. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, 191–198.
 [Crandall2012] Crandall, J. W. 2012. Just add pepper: extending learning algorithms for repeated matrix games to repeated markov games. In Proceedings of the 2012 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2012, 399–406.
 [De Weerd, Verbrugge, and Verheij2013] De Weerd, H.; Verbrugge, R.; and Verheij, B. 2013. How much does it help to know what she knows you know? an agentbased simulation study. Artificial Intelligence 199:67–92.
 [Elidrisi et al.2014] Elidrisi, M.; Johnson, N.; Gini, M.; and Crandall, J. 2014. Fast adaptive learning in repeated stochastic games by game abstraction. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, 1141–1148. International Foundation for Autonomous Agents and Multiagent Systems.
 [Fernández and Veloso2006] Fernández, F., and Veloso, M. 2006. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, 720–727. ACM.

[Fernández, García, and
Veloso2010]
Fernández, F.; García, J.; and Veloso, M.
2010.
Probabilistic policy reuse for intertask transfer learning.
Robotics and Autonomous Systems 58(7):866–871.  [Gmytrasiewicz and Doshi2005] Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequential planning in multiagent settings. J. Artif. Intell. Res.(JAIR) 24:49–79.
 [Gmytrasiewicz and Durfee2000] Gmytrasiewicz, P. J., and Durfee, E. H. 2000. Rational coordination in multiagent environments. Autonomous Agents and MultiAgent Systems 3(4):319–350.
 [Goldman2012] Goldman, A. I. 2012. Theory of mind. The Oxford handbook of philosophy of cognitive science 402–424.
 [Goodrich, Crandall, and Stimpson2003] Goodrich, M. A.; Crandall, J. W.; and Stimpson, J. R. 2003. Neglect tolerant teaming: Issues and dilemmas. In AAAI Spring Sympo. on Human Interaction with Autonomous Systems in Complex Environments.
 [He and BoydGraber2016] He, H., and BoydGraber, J. L. 2016. Opponent modeling in deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, 1804–1813.
 [HernandezLeal and Kaisers2017] HernandezLeal, P., and Kaisers, M. 2017. Towards a fast detection of opponents in repeated stochastic games. In Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems, 239–257. Springer.
 [HernandezLeal et al.2016] HernandezLeal, P.; Taylor, M. E.; Rosman, B.; Sucar, L. E.; and de Cote, E. M. 2016. Identifying and tracking switching, nonstationary opponents: A bayesian approach. In Multiagent Interaction without Prior Coordination, Papers from the 2016 AAAI Workshop.
 [HernandezLeal et al.2017a] HernandezLeal, P.; Kaisers, M.; Baarslag, T.; and de Cote, E. M. 2017a. A survey of learning in multiagent environments: Dealing with nonstationarity. CoRR abs/1707.09183.
 [HernandezLeal et al.2017b] HernandezLeal, P.; Zhan, Y.; Taylor, M. E.; Sucar, L. E.; and de Cote, E. M. 2017b. Efficiently detecting switches against nonstationary opponents. Autonomous Agents and MultiAgent Systems 31(4):767–789.
 [Littman1994] Littman, M. L. 1994. Markov games as a framework for multiagent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, 157–163.
 [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529.
 [Powers and Shoham2005] Powers, R., and Shoham, Y. 2005. Learning against opponents with bounded memory. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, 817–822.
 [Rosman, Hawasly, and Ramamoorthy2016] Rosman, B.; Hawasly, M.; and Ramamoorthy, S. 2016. Bayesian policy reuse. Machine Learning 104(1):99–127.
 [Shoham and LeytonBrown2009] Shoham, Y., and LeytonBrown, K. 2009. Multiagent Systems  Algorithmic, GameTheoretic, and Logical Foundations. Cambridge University Press.
 [v. Neumann1928] v. Neumann, J. 1928. Zur theorie der gesellschaftsspiele. Mathematische Annalen 100(1):295–320.
 [Wunder et al.2011] Wunder, M.; Kaisers, M.; Yaros, J. R.; and Littman, M. 2011. Using iterated reasoning to predict opponent strategies. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, 593–600. International Foundation for Autonomous Agents and Multiagent Systems.
 [Wunder et al.2012] Wunder, M.; Yaros, J. R.; Kaisers, M.; and Littman, M. 2012. A framework for modeling population strategies by depth of reasoning. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, 947–954. International Foundation for Autonomous Agents and Multiagent Systems.
Comments
There are no comments yet.