Bayes-ToMoP: A Fast Detection and Best Response Algorithm Towards Sophisticated Opponents

09/12/2018 ∙ by Tianpei Yang, et al. ∙ Tsinghua University Tianjin University 0

Multiagent algorithms often aim to accurately predict the behaviors of other agents and find a best response during interactions accordingly. Previous works usually assume an opponent uses a stationary strategy or randomly switches among several stationary ones. However, in practice, an opponent may exhibit more sophisticated behaviors by adopting more advanced strategies, e.g., using a bayesian reasoning strategy. This paper presents a novel algorithm called Bayes-ToMoP which can efficiently detect and handle opponents using either stationary or higher-level reasoning strategies. Bayes-ToMoP also supports the detection of previous unseen policies and learning a best response policy accordingly. Deep Bayes-ToMoP is proposed by extending Bayes-ToMoP with DRL techniques. Experimental results show both Bayes-ToMoP and deep Bayes-ToMoP outperform the state-of-the-art approaches when faced with different types of opponents in two-agent competitive games.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

In multiagent systems, the ideal behavior of an agent is contingent on the behaviors of coexisting agents. However, agents may exhibit different behaviors adaptively depending on the contexts they encounter. Hence, it is critical for an agent to quickly predict or recognize the behaviors of other agents, and make a best response accordingly [Boutilier1999, Powers and Shoham2005, Gmytrasiewicz and Doshi2005, Banerjee and Peng2005, Fernández and Veloso2006, Chakraborty and Stone2010, Fernández, García, and Veloso2010, Wunder et al.2012, Bard et al.2013, Elidrisi et al.2014, Hernandez-Leal and Kaisers2017, Albrecht and Stone2018, Hernandez-Leal et al.2017a].

One efficient way of recognizing the strategies of other agents is to leverage the idea of Bayesian Policy Reuse (BPR) [Rosman, Hawasly, and Ramamoorthy2016], which was originally proposed for multi-task learning problems to determine the best policy when faced with different tasks. [Hernandez-Leal et al.2016] proposed BPR+ by extending BPR to multiagent learning settings to detect the dynamic changes of an opponent’s strategies. BPR+ also extends BPR with the ability of generating new policies online against an opponent using previously unseen policies. However, BPR+ is designed for single-state repeated games only. Recently, [Hernandez-Leal and Kaisers2017] proposed Bayes-Pepper by combing BPR and Pepper framework [Crandall2012] to handle stochastic games. However, all these approaches assume that an opponent randomly switches its policies among a number of stationary policies. In practice, an opponent can exhibit more sophisticated behaviors by adopting more advance strategies, e.g., using BPR as well. A BPR-based agent usually cannot behave well against itself, which is also confirmed in our experiments. In such situations, higher-level reasonings are needed and more advanced techniques are required for an agent to predict and play well against such kinds of sophisticated opponents.

The above problem can be partially addressed by introducing the concept of Theory of Mind (ToM) [Goldman2012]. ToM is a kind of recursive reasoning technique [Hernandez-Leal et al.2017a, Albrecht and Stone2018] describing a higher cognitive mechanism of explicitly attributing unobservable mental contents such as beliefs, desires, and intentions to other players. Previous methods of recursive reasoning use explicit representations of nested beliefs and “simulate” the reasoning processes of other agents to predict their actions [Gmytrasiewicz and Durfee2000, Gmytrasiewicz and Doshi2005, Wunder et al.2011, Wunder et al.2012]. However, these approaches show no adaptation to non-stationary opponents. One exception is a reasoning framework proposed by [De Weerd, Verbrugge, and Verheij2013], this framework enables an agent to predict the opponent’s actions explicitly by building an abstract model of the opponent’s behaviors using recursive nested beliefs. Additionally, they use a confidence value to help an agent to adapt to different opponents. However, the main drawbacks of this model are: 1) it works only if an agent holds exactly one more layer of belief than its opponent; 2) it is designed for predicting the opponent’s primitive actions instead of high-level strategies, resulting in slow adaptation to non-stationary opponents; 3) it shows poor performance against an opponent using previously unseen strategies.

To address the above challenges, we propose a novel algorithm, Bayesian Theory of Mind on Policy (Bayes-ToMoP), which leverages the predictive power of BPR and ToM, to compete against sophisticated opponents. Bayes-ToMoP can quickly and accurately predict an opponent’s behavior and compute a best response accordingly. Besides, Bayes-ToMoP also supports detecting whether an opponent is using a previously unseen policy and learning an optimal response against it. Furthermore, Bayes-ToMoP can be straightforwardly extended to Deep Reinforcement Learning (DRL) environment with a neural network as the value function approximator, termed as deep Bayes-ToMoP. Experimental results show that both Bayes-ToMoP and deep Bayes-ToMoP outperform the state-of-the-art BPR approaches when faced with different types of opponents in two-agent competitive games.

Background

Bayesian Policy Reuse BPR was originally proposed in [Rosman, Hawasly, and Ramamoorthy2016] as a framework for an agent to quickly determine the best policy to execute when faced with an unknown task. Given a set of previously-solved tasks and an unknown task , the agent is required to select the best policy from the policy library within as small number of trials as possible. BPR uses the concept of belief

, which is a probability distribution over the set of tasks

, to measure the degree to which matches the known tasks based on the signal . A signal can be any information that is correlated with the performance of a policy (e.g., immediate rewards, episodic returns). BPR involves performance models of policies on previously-solved tasks, which describes the distribution of returns from each policy on previously-solved tasks. A performance model is a probability distribution over the utility of a policy on a task .

In order to select the best policy, a number of BPR variants with exploration heuristics are available, e.g., probability of improvement (BPR-PI) heuristic and expected improvement (BPR-EI) heuristic. BPR-PI heuristic utilizes the probability with which a specific policy can achieve a hypothesized increase in performance over the current best estimate:

. Formally, it chooses the policy most likely to achieve the utility : , where . However, it is not straightforward to determine the appropriate value of , thus another way of avoiding this issue is BPR-EI heuristic, which selects the policy most likely to achieve any possible utilities of improvement : . [Rosman, Hawasly, and Ramamoorthy2016] showed BPR-EI heuristic performs best among all BPR variants. Therefore, we choose BPR-EI heuristic for playing against different opponents.

Theory of Mind  ToM model [De Weerd, Verbrugge, and Verheij2013] is used to predict an opponent’s action explicitly by building an abstract model of the opponent’s behaviors using recursive nested beliefs. ToM model is described in the context of a two-player competitive game where an agent and its opponent differ in their abilities to make use of ToM. The notion of ToM indicates an agent that has the ability to use ToM up to the -th order. [De Weerd, Verbrugge, and Verheij2013] showed that the reasoning levels deeper than 2 do not provide significant benefits, so we briefly introduce the first two orders of ToM models.

A zero-order ToM (ToM) agent holds its zero-order belief in the form of a probability distribution on the action set of its opponent. The ToM agent then chooses the action that maximizes its expected payoff. A first-order ToM (ToM) agent keeps both zero-order belief and first-order belief. The first-order belief is a probability distribution that describes what the ToM agent believes its opponent believes about itself. The ToM agent first predicts its opponent’s action under its first-order belief. Then, the ToM agent integrates its first-order prediction with the zero-order belief and uses this integrated belief in the final decision.

Bayes-ToMoP

Motivation

Previous works [Hernandez-Leal et al.2016, Hernandez-Leal and Kaisers2017, Hernandez-Leal et al.2017b] assume that an opponent randomly switches its policies among a number of stationary policies. However, a more sophisticated agent may choose when to change its policy in a more principled way. For instance, it first predicts the policy of its opponent and then best responds towards the estimated policy accordingly. If the opponent’s policy is estimated by simply counting the action frequencies, it is then reduced to the well-known fictitious play algorithm [Shoham and Leyton-Brown2009]. However, in general, an opponent’s action information may not be observable during interactions. One way of addressing this problem is using BPR [Rosman, Hawasly, and Ramamoorthy2016], which uses Bayes’ rule to predict the policy of the opponent according to the received signals (e.g., rewards), and can be regarded as the generalization of fictitious play.

Therefore, a question naturally arises: how an agent can effectively play against both simple opponents with stationary strategies and more advanced ones (e.g., using BPR)? To address this question, we propose a new algorithm called Bayes-ToMoP, which leverages the predictive power of BPR and recursive reasoning ability of ToM to predict the strategies of such opponents and behave optimally. We also extend Bayes-ToMoP to DRL scenarios with a neural network as the value function approximator, termed as deep Bayes-ToMoP. In the following descriptions, we do not distinguish whether a policy is represented in a tabular form or a neural network unless necessary.

We use the notion of Bayes-ToMoP to indicate an agent with the ability of using theory of mind up to the -th order. Intuitively, Bayes-ToMoP with a higher-order ToM could take advantage of any Bayes-ToMoP with a lower-order ToM (). Without loss of generality, we focus on Bayes-ToMoP and Bayes-ToMoP. Bayer-TomoP () can be naturally constructed by incorporating a higher-order theory of mind idea into our framework. Another reason that we focus on zero and first-order ToMoP is that it has been found that the benefit of considering deeper recursion (higher-level ToM, ) is marginal [De Weerd, Verbrugge, and Verheij2013]. Bayes-ToMoP is capable of learning a new policy against its opponent if it detects the opponent is using a previously unknown strategy. In contrast, the above two features are not satisfied by the previous ToM model which works only if an agent holds exactly one more layer of belief than its opponent.

Bayes-ToMoP Algorithm

We start with the simplest case of Bayes-ToMoP. Bayes-ToMoP extends ToM by incorporating Bayesian reasoning techniques to predict the strategy of an opponent. Bayes-ToMoP holds a zero-order belief about its opponent’s strategies , each of which is a probability that its opponent may adopt each strategy : . . Give a utility , a performance model describes the possibility of an agent using a policy against an opponent’s strategy .

Initialize: Policy library , , zero-order belief , performance models

1:  for each episode do
2:     Select the optimal policy:
3:     Play and get the episodic reward
4:     for each opponent strategy  do
5:        Update zero-order belief :
6:     end for
7:     Detect and learn a new strategy of the opponent (Algorithm 3)
8:  end for
Algorithm 1 Bayes-ToMoP algorithm

For Bayes-ToMoP agent (Algorithm 1), it starts with its policy library and its opponent’s strategy library , its performance models and the zero-order belief . Then, in each episode, given the current belief , Bayes-ToMoP agent evaluates the expected improvement utility defined following BPR-EI heuristic for all policies and then selects the optimal one (Line 2). Next, Bayes-ToMoP agent updates its zero-order belief using Bayes’ rule (Line 4-6). At the end of the episode, Bayes-ToMoP detects whether its opponent is using a previously unseen policy. If yes, it learns a new policy against its opponent (Line 7). The new strategy detection and learning algorithm will be described in Section New Opponent Detection and Learning in detail.

Finally, note that without the new strategy detection and learning phase (Line 7), Bayes-ToMoP agent is essentially equivalent with BPR [Rosman, Hawasly, and Ramamoorthy2016] since they both first predict the opponent’s strategy (or taks type) and then select the optimal policy, each strategy of the opponent here can be regarded as a task in the original BPR. Besides, the full Bayes-ToMoP is essentially equivalent with BPR+ [Hernandez-Leal et al.2016] since both can handle previously unseen strategies.

Bayes-ToMoP Algorithm

Next, we move to Bayes-ToMoP algorithm. Apart from its zero-order belief, Bayes-ToMoP also maintains a first-order belief, which is a probability distribution that describes the probability that an agent believes his opponent believes it will choose a policy . Bayes-ToMoP makes a prediction of its opponent’s policy based on its first-order belief. However, this prediction may conflict with its zero-order belief. To address this conflict, Bayes-ToMoP holds a first-order confidence serving as the weighting factor to balance the influence degree between its first-order prediction and zero-order belief.

Initialize: Policy library and , performance models and , zero-order belief , first-order belief

1:  for each episode do
2:     Compute the first-order prediction :
3:     Integrate with : (see Equation (1))
4:     Select the optimal policy :
5:     Play the game, get the episodic reward
6:     for each policy  do
7:        Update first-order belief :
8:     end for
9:     for each opponent strategy  do
10:        Update zero-order belief :
11:     end for
12:     Update (following Equation (3))
13:     Detect and learn a new strategy of the opponent (see Algorithm 3)
14:  end for
Algorithm 2 Bayes-ToMoP Algorithm

The overall strategy of Bayes-ToMoP is shown in Algorithm 2. Given the policy library and , performance models and , zero-order belief and first-order belief , Bayes-ToMoP agent first predicts the policy of its opponent assuming the opponent maximizes the utility based on BPR-EI heuristic under its first-order belief (Line 2). Then, with the first-order predicted policy and its zero-order belief , the integration function is introduced to compute the final prediction results which is defined as the linear combination of the first-order prediction and zero-order belief weighted by the confidence degree as follows (Line 3):

(1)

Next, Bayes-ToMoP agent computes the expected utility improvement based on the performance models and the integrated belief to derives an optimal policy (Line 4). At last, Bayes-ToMoP agent updates its first-order belief and zero-order belief using Bayes’ rule (Line 5-11).

The next issue is how to adaptively update the first-order confidence degree . Initially we are not sure whether an opponent is making decisions using ToM reasoning or simply switching among stationary strategies. The value of can be understood as the exploration rate of using first-order belief to predict the opponent’s strategies. In previous ToM model [De Weerd, Verbrugge, and Verheij2013], the value of is increased when the first-order prediction is correct and decreased otherwise based on the assumption that an agent can observe the actions and rewards of its opponent. However, in our settings, the prediction works on higher level of behaviors (the policies), which usually are not available (agents are not willing to reveal their policies to others to avoid being exploited in competitive environments). Therefore, we propose using game outcomes as the signal to indicate whether our previous predictions are correct and adjust the first-order confidence degree accordingly. In a competitive environment, we can distinguish game outcomes into three cases: win, lose or draw by comparing two agents’ rewards. Thus, the value of is increased when the agent’s reward is larger than its opponent’s reward and decreased otherwise by an adjustment rate of :

(2)

Following this heuristic, Bayes-ToMoP can easily take advantage of Bayes-ToMoP since it can well predict which policy Bayes-ToMoP will take in advance. However, this does not work when it plays against less sophisticated agents, e.g., an agent switching among several stationary policies without the ability of using ToM. This is due to the fact that the curve of becomes oscillating when it is faced with an agent who is unable to make use of ToM, thus fails to predict the opponent’s behaviours accurately.To this end, we propose an adaptive and generalized mechanism to adjust the confidence degree and further detect the switches of the opponent’s strategies.

We first introduce the concept of winning rate during a fixed length of episodes to adaptively adjust the confidence degree . Since Bayes-ToMoP agent assumes its opponent is Bayes-ToMoP at first, the value of controls the number of episodes before considering its opponent may switch to a less sophisticated type. If the current episode’s performance is better than the previous episode (), we increase the weight of using first-order prediction, i.e., increasing the value of with an adjustment rate ; if is less than but still higher than a threshold , it indicates the performance of the first-order prediction diminishes. Then Bayes-ToMoP decreases the value of quickly with a decreasing factor ; if , the rate of exploring first-order belief is set to 0 and only zero-order belief is used for prediction. Formally we have:

(3)

where is the threshold of the winning rate , which reflects the lower bound of the difference between its prediction and its opponent’s actual behaviors. is an indicator function to control the direction of adjusting the value of . Intuitively, Bayes-ToMoP detects the switching of its opponent’s strategies at each episode and reverses the value of whenever its winning rate is no larger than :

(4)

Finally, at the end of each episode, Bayes-ToMoP learns a new optimal policy following Algorithm 3 if it detects a new opponent strategy (detailed in next section).

New Opponent Detection and Learning

Initialize: Policy library , , belief , performance models

1:  if New opponent strategy is detected then
2:     Learn an optimal policy against  
3:     Compute ,  
4:     Compute ,  
5:     Compute ,  
6:     Update each order of belief
7:     
8:     
9:  end if
Algorithm 3 Detecting and learning opponent new strategy

The new opponent detection and learning component is the same for all Bayes-ToMoP agents () (Algorithm 3). Bayes-ToMoP first detects whether its opponent is using a new strategy (Line 1). This is achieved by recording a length of game outcomes and checking whether its opponent is using one of the known strategies under the performance models at each episode. In details, Bayes-ToMoP keeps a length of memory recording the game outcomes (win, lose or draw) at each episode , and uses the winning rate over the most recent episodes as the signal indicating the performance under the current policy library. If the winning rate is lower than a given threshold (), it indicates that all existing policies show poor performance against the current opponent strategy even with the confidence degree adjustment mechanism, in this way Bayes-ToMoP agent infers that the opponent is using a previously unseen policy outside the current policy library.

Since we can easily obtain the average winning rate of each policy against each known opponent strategy , the lowest winning rate among the best response policies () can be seen as an upper bound of the value of . The value of controls the number of episodes before considering whether the opponent is using a previously unseen strategy. Note that the accuracy of detection is increased with the increase of the memory length , however, a larger value of would necessarily increase the detection delay. The optimal memory length is determined empirically through extensive simulations.

After detecting the opponent is using a new strategy, the agent begins to learn a new policy against it (Line 2). Following previous work [Hernandez-Leal et al.2016], we adopt the same assumption that the opponent will not change its strategy during the learning phase (a number of rounds). Otherwise, the learning process may not converge. We adopt the traditional model-based RL: R-max [Brafman and Tennenholtz2002] to compute the optimal policy. Specifically, once a new strategy is detected, R-max estimates the state transition function and reward function with and . R-max computes for all state-action pairs and selects the action that maximizes according to -greedy mechanism. For deep Bayes-ToMoP, we apply DQN [Mnih et al.2015] to do off-policy learning using the obtained interaction experience. DQN is a deep Q-learning method with experience replay, approximating the Q-function using a neural network. DQN draws samples from a replay memory, the neural network predicts

by minimizing the loss function.

After obtaining the optimal policy, Bayes-ToMoP generates new performance models as a set of Gaussian probability distributions for both itself and its opponent based on the assumption that in competitive environment, it knows the rewards of its opponent (Line 3-5). Next, Bayes-ToMoP updates its belief up to -th order (Line 6). Finally, it adds the new policies and to its policy library and opponent’s policy library respectively (Line 7-8).

Theoretical Analysis

In this section, we provide theoretical analysis that Bayes-ToMoP can accurately detect the opponent’s strategy from a known policy library and derives an optimal response policy accordingly.

Theorem 1

(Optimality on Strategy Detection) If the opponent plays a strategy from the known policy library, Bayes-ToMoP can detect the strategy w.p.1 and selects an optimal response policy accordingly.

Proof 1

Suppose the opponent strategy is , Bayes-ToMoP receives a signal at time step , the belief is updated following the equation,

(5)

Then there exists a policy , that makes the inequality establish for all . Besides, since is bounded () and monotonically increasing (), based on the monotone convergence theorem, we can easily know the limit of sequence exists. Thus, if we limit the two sides of the Equation 5, the following equation establishes,

(6)

iif . So that Bayes-ToMoP can detect the strategy w.p.1 and selects an optimal response policy accordingly.

As guaranteed by Theorem 1, Bayes-ToMoP behaves optimally when the opponent uses a strategy from the known policy library. If the opponent is using a previously unseen strategy, following the new opponent strategy detection heuristic, this phenomenon can be exactly detected when the winning rate of Bayes-ToMoP during a fixed length of period is lower than the accepted threshold. Then Bayes-ToMoP begins to learn an optimal response policy. The performance of Bayes-ToMoP against opponents using either known or previously unseen strategies will be extensively evaluated through empirical simulations in the next section.

Simulations

Figure 1: The soccer game.
Figure 2: Modified packman.

Game Settings

(a) RPS
(b) Tabular soccer
(c) Modified packman
(d) Deep soccer
(a) RPS
(b) Tabular soccer
(c) Modified packman
(d) Deep soccer
Figure 3: Average rewards of Bayes-ToMoP and BPR against an opponent O on different games.
Figure 4: Average rewards of Bayes-ToMoP and BPR against a non-stationary opponent O on different games.
Figure 3: Average rewards of Bayes-ToMoP and BPR against an opponent O on different games.

We evaluate the performance of Bayes-ToMoP on various testbeds: rock-paper-scissors (RPS) [v. Neumann1928, Shoham and Leyton-Brown2009], soccer [Littman1994, He and Boyd-Graber2016] and modified packman [Goodrich, Crandall, and Stimpson2003, Crandall2012]. We consider two versions of three games with different state representations. For the first version, we discretize the state space into a few number of discrete states, in which Q-Values can be represented in a tabular form. We also consider the complete state space which consists of different dimensions of information, for example, states in soccer includes coordinates of two agents, the axis limits of the field. In this case, we evaluate the performance of deep Bayes-ToMoP. RPS is a two-player stateless game in which two players simultaneously choose one of the three possible actions ‘rock’ (R), ‘paper’ (P), or ‘scissors’ (S). If both choose the same action, the game ends in a tie. Otherwise, the player who chooses R wins against the one that chooses S, S wins against P, and P wins against R.

Soccer (Figure 2) is a stochastic game on a grid. Two players, A and B, start at one of starting points in the left and right respectively, and can choose one of the following 5 actions: go left, right, up, down and stay. Any action that goes to grey grids or beyond the border is invalid. The circle in Figure 2 represents the ball and it is randomly assigned to one of two players initially. The possession of the ball switches when two players move to the same grid. A player scores one point if it takes the ball to its opponent’s goals. If neither player gets a score within 50 steps, the game ends with a tie. Six deterministic policies are generated for the opponent in both two soccer games. The Bayes-ToMoP agent is equipped with a policy library of corresponding pre-trained response policies. For deep Bayes-ToMoP, DQN is used to train response policy networks. DQN has two fully-connected hidden layers both with 20 hidden units. The output layer is a fully-connected linear layer with a single output for 5 actions.

Modified packman (Figure 2) derived from the SGPD game [Goodrich, Crandall, and Stimpson2003, Crandall2012] is a grid game with two players A and B, who both start at one of starting points in the left and right respectively, and choose from following 5 actions: go left, right, up, down and stay. Any action that goes to black-slash grids or beyond the border is invalid. Player A scores one point if two players move to one goal simultaneously (A catches B), otherwise, it loses one point if player B moves to one goal without being caught. If neither player gets a score within 50 steps, the game ends with a tie. We construct 24 deterministic policies for the opponent and corresponding pre-trained policies make up the policy library of Bayes-ToMoP agent.

Unless otherwise mentioned, all experiments use the same parameter settings: . All results are averaged over 100 runs.

Performance against Different Opponents

We evaluate the ability of Bayes-ToMoP and deep Bayes-ToMoP on detecting its opponent’s strategies in comparison with state-of-the-art BPR approaches (BPR+ [Hernandez-Leal et al.2016] and Bayes-Pepper [Hernandez-Leal et al.2017b]). Three kinds of opponents are considered: (1) a Bayes-ToMoP opponent (O); (2) an opponent that randomly switches its policy among stationary strategies and lasts for a unknown number of episodes (O) and (3) an opponent switching its strategy between stationary strategies and Bayes-ToMoP (O-). Both BPR+ and Bayes-Pepper detect an opponent’s strategy based on the predictive power of BPR [Rosman, Hawasly, and Ramamoorthy2016]. We pre-train the response policies to make up the policy library using Q-learning (using DQN for deep Bayes-ToMoP), this can also be built automatically using existing techniques such as Pepper [Crandall2012]. Besides, in this section, we assume an opponent only switches policies among known ones. Thus Bayes-Pepper is equivalent with BPR+ in our setting and we use BPR to denote both strategies. We only give experiments on deep soccer due to the limit of paper length.

We first compare Bayes-ToMoP with BPR against O on various games. We can see from Figure 4 that only Bayes-ToMoP can quickly detect the opponent’s strategies and achieve the highest average rewards. In contrast, BPR fails against O. Same comparison results can be found for deep Bayes-ToMoP. This is because Bayes-ToMoP explicitly considers first-order belief and use it to predict which policy its opponent would select and then derives an optimal policy against it. However, BPR makes decisions based on BPR-PI heuristic which is essentially equivalent with Bayes-ToMoP. Therefore, neither BPR nor Bayes-ToMoP could take advantage of each other and the winning percentages are expected to be approximately 50%. Average winning rates shown in Table 1 also confirm our hypothesis.

(a) RPS
(b) Tabular soccer
(c) Modified packman
(d) Deep soccer
Figure 5: Against an opponent O switching between non-stationary and Bayes-ToMoP on different games.
Figure 6: Average reward of Bayes-ToMoP in soccer.
Figure 7: Average reward of Deep Bayes-ToMoP in soccer.
Figure 8: The impact of memory length .
Figure 9: The impact of threshold .

Next we consider the case of playing against O, and the performance of Bayes-ToMoP and BPR on three games are shown in Figure 4. We can see that both Bayes-ToMoP and BPR can quickly detect which stationary strategy the opponent is using and derive the optimal policy against it. We observe that Bayes-ToMoP requires longer time before achieving the best reward than BPR. This happens because Bayes-ToMoP needs additional time to determine that the opponent is not using a BPR-based strategy. This phenomenon is consistent with the slightly lower winning rate of Bayes-ToMoP than BPR depicted in Table 1.

Approaches /
Opponents BPR Bayes-ToMoP
O 49.78%1.71% 99.72%0.16%
O 99.37%0.72% 97.94%0.37%
O- 33.4%0.57% 98.48%0.54%
Table 1: Average winning rates with std.dev.() in tabular soccer.

Finally, we consider the performance of Bayes-ToMoP and BPR against O- (see Figure 9). We note that only Bayes-ToMoP can quickly detect its opponent’s strategies and achieve the highest average rewards. In contrast, BPR fails when its opponent’s strategy switches to Bayes-ToMoP. Table 1 shows Bayes-ToMoP’s winning rate against O- is around 98.48% and BPR’s winning rate is around 33.4%, which is consistent with the results in Figure 9 (b). Using our confidence degree adjusting mechanism, Bayes-ToMoP can successfully identify its opponent’s type and derive an optimal policy. However, BPR only makes decisions based on the belief over the opponent’s past policies and thus cannot handle BPR-based opponents.

New Opponent Detection and Learning

In this section, we evaluate Bayes-ToMoP against an opponent who may use previously unknown strategies. We manually design a new strategy for the opponent in soccer (i.e., moving along the down edge of the grids). The opponent starts with one of known strategies and switches to the new strategy at 1000th episode. Figure 9 and 9 show the dynamics of average rewards of Bayes-ToMoP and deep Bayes-ToMoP respectively. We can see that when the opponent switches to an unknown strategy, both Bayes-ToMoP and and deep Bayes-ToMoP can quickly detect this change, and finally learn an optimal policy.

The Influence of Key Parameters

Figure 9 depicts the impact of the memory length on the average adjustment time of Bayes-ToMoP before taking advantage of O. We observe a diminishing return phenomenon: the average adjustment time decreases quickly as the initial increase in the memory length, but quickly render additional performance gains marginal. The average adjustment time stabilizes around 2.8 when . We hypothesize that it is because the dynamic changes of the winning rate over a relatively small length of memory may be caused by noise thus resulting in inaccurate opponent type detection. As the increase of the memory length, the judgement about its opponent’s types is more precise. However, as the memory length exceeds certain threshold, the winning rate estimation is already accurate enough and thus the advantage of further increasing the memory length diminishes.

Finally, the influence of threshold on the average adjustment time of Bayes-ToMoP against opponent O is shown in Figure 9. We note that the average adjustment time decreases as increases, but the decrease degree gradually stabilizes when is larger than 0.7. With the increase of the value of , the winning rate decreases to more quickly when the opponent switches its policy. Thus, Bayes-ToMoP detects the switching of its opponent’s strategies more quickly. Similar with the results in Figure 9, as the value of exceeds certain threshold, the advantage based on this heuristic diminishes.

Conclusion and Future Work

This paper presents a novel algorithm called Bayes-ToMoP, which leverages the predictive power of BPR and recursive reasoning ability of ToM to quickly predict the behaviors of not only switching, non-stationary opponents and also more sophisticated ones (e.g., BPR-based) and behave optimally. Our Bayes-ToMoP also enables an agent to learn a new optimal policy when encountering a previously unseen strategy. As future work, it is worth investigating how to extend Bayes-ToMoP to multi-opponent scenarios.

References

  • [Albrecht and Stone2018] Albrecht, S. V., and Stone, P. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence 258:66–95.
  • [Banerjee and Peng2005] Banerjee, B., and Peng, J. 2005. Efficient learning of multi-step best response. In 4th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2005), July 25-29, 2005, Utrecht, The Netherlands, 60–66.
  • [Bard et al.2013] Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013. Online implicit agent modelling. In Proceedings of the 2013 International conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2013, 255–262.
  • [Boutilier1999] Boutilier, C. 1999. Sequential optimality and coordination in multiagent systems. In IJCAI, volume 99, 478–485.
  • [Brafman and Tennenholtz2002] Brafman, R. I., and Tennenholtz, M. 2002. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning.

    Journal of Machine Learning Research

    3:213–231.
  • [Chakraborty and Stone2010] Chakraborty, D., and Stone, P. 2010. Convergence, targeted optimality, and safety in multiagent learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, 191–198.
  • [Crandall2012] Crandall, J. W. 2012. Just add pepper: extending learning algorithms for repeated matrix games to repeated markov games. In Proceedings of the 2012 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2012, 399–406.
  • [De Weerd, Verbrugge, and Verheij2013] De Weerd, H.; Verbrugge, R.; and Verheij, B. 2013. How much does it help to know what she knows you know? an agent-based simulation study. Artificial Intelligence 199:67–92.
  • [Elidrisi et al.2014] Elidrisi, M.; Johnson, N.; Gini, M.; and Crandall, J. 2014. Fast adaptive learning in repeated stochastic games by game abstraction. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, 1141–1148. International Foundation for Autonomous Agents and Multiagent Systems.
  • [Fernández and Veloso2006] Fernández, F., and Veloso, M. 2006. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, 720–727. ACM.
  • [Fernández, García, and Veloso2010] Fernández, F.; García, J.; and Veloso, M. 2010.

    Probabilistic policy reuse for inter-task transfer learning.

    Robotics and Autonomous Systems 58(7):866–871.
  • [Gmytrasiewicz and Doshi2005] Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequential planning in multi-agent settings. J. Artif. Intell. Res.(JAIR) 24:49–79.
  • [Gmytrasiewicz and Durfee2000] Gmytrasiewicz, P. J., and Durfee, E. H. 2000. Rational coordination in multi-agent environments. Autonomous Agents and Multi-Agent Systems 3(4):319–350.
  • [Goldman2012] Goldman, A. I. 2012. Theory of mind. The Oxford handbook of philosophy of cognitive science 402–424.
  • [Goodrich, Crandall, and Stimpson2003] Goodrich, M. A.; Crandall, J. W.; and Stimpson, J. R. 2003. Neglect tolerant teaming: Issues and dilemmas. In AAAI Spring Sympo. on Human Interaction with Autonomous Systems in Complex Environments.
  • [He and Boyd-Graber2016] He, H., and Boyd-Graber, J. L. 2016. Opponent modeling in deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, 1804–1813.
  • [Hernandez-Leal and Kaisers2017] Hernandez-Leal, P., and Kaisers, M. 2017. Towards a fast detection of opponents in repeated stochastic games. In Proceedings of the 2017 International Conference on Autonomous Agents and Multiagent Systems, 239–257. Springer.
  • [Hernandez-Leal et al.2016] Hernandez-Leal, P.; Taylor, M. E.; Rosman, B.; Sucar, L. E.; and de Cote, E. M. 2016. Identifying and tracking switching, non-stationary opponents: A bayesian approach. In Multiagent Interaction without Prior Coordination, Papers from the 2016 AAAI Workshop.
  • [Hernandez-Leal et al.2017a] Hernandez-Leal, P.; Kaisers, M.; Baarslag, T.; and de Cote, E. M. 2017a. A survey of learning in multiagent environments: Dealing with non-stationarity. CoRR abs/1707.09183.
  • [Hernandez-Leal et al.2017b] Hernandez-Leal, P.; Zhan, Y.; Taylor, M. E.; Sucar, L. E.; and de Cote, E. M. 2017b. Efficiently detecting switches against non-stationary opponents. Autonomous Agents and Multi-Agent Systems 31(4):767–789.
  • [Littman1994] Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, 157–163.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529.
  • [Powers and Shoham2005] Powers, R., and Shoham, Y. 2005. Learning against opponents with bounded memory. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, 817–822.
  • [Rosman, Hawasly, and Ramamoorthy2016] Rosman, B.; Hawasly, M.; and Ramamoorthy, S. 2016. Bayesian policy reuse. Machine Learning 104(1):99–127.
  • [Shoham and Leyton-Brown2009] Shoham, Y., and Leyton-Brown, K. 2009. Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press.
  • [v. Neumann1928] v. Neumann, J. 1928. Zur theorie der gesellschaftsspiele. Mathematische Annalen 100(1):295–320.
  • [Wunder et al.2011] Wunder, M.; Kaisers, M.; Yaros, J. R.; and Littman, M. 2011. Using iterated reasoning to predict opponent strategies. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, 593–600. International Foundation for Autonomous Agents and Multiagent Systems.
  • [Wunder et al.2012] Wunder, M.; Yaros, J. R.; Kaisers, M.; and Littman, M. 2012. A framework for modeling population strategies by depth of reasoning. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, 947–954. International Foundation for Autonomous Agents and Multiagent Systems.