Searching with Opponent-Awareness

04/21/2021 ∙ by Timy Phan, et al. ∙ 0

We propose Searching with Opponent-Awareness (SOA), an approach to leverage opponent-aware planning without explicit or a priori opponent models for improving performance and social welfare in multi-agent systems. To this end, we develop an opponent-aware MCTS scheme using multi-armed bandits based on Learning with Opponent-Learning Awareness (LOLA) and compare its effectiveness with other bandits, including UCB1. Our evaluations include several different settings and show the benefits of SOA are especially evident with increasing number of agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, artificial intelligence methods have led to significant advances in several areas, especially reinforcement learning

(Mnih et al., 2013; Silver et al., 2016, 2017b, 2017a; Schrittwieser et al., 2020; Vinyals et al., 2019). Among these methods, planning is notable for exploiting the option of predicting possible directions into which a scenario can evolve. Planning has played a key role in achieving state-of-the-art performance in challenging domains like Atari, shogi, chess, go and hex Silver et al. (2016, 2017b, 2017a); Schrittwieser et al. (2020); Anthony et al. (2017). Monte-Carlo planning enables scalable decision making via stochastic sampling by using a generative model as black box simulator instead of explicit environment models, where state and reward distributions need to be specified a priori Silver and Veness (2010).

In a multi-agent system, individual agents have to make decisions in environments with higher complexity w.r.t. the number of dynamic elements (i.e. other agents). Agents may need to coordinate or to compete to satisfy their interests which results in a balance of conflict and cooperation. In a real-world use-case of multi-agent systems, like self-driving cars or workers in an industrial factory, agents are required to coordinate to avoid collisions and yet reach their individual goals. These interactions are studied in game theory and usually lead to a Nash equilibrium

Nash (1950); Shoham and Leyton-Brown (2008); Hamilton (1992).

Because the actions of each agent affect the payoffs of all other agents, it makes sense for agents to have an opponent model or to be opponent-aware in order to optimize their individual decisions. A naive agent only considers its own interests which can lead to traffic jams or accidents in the case of autonomous cars when conflicting interests escalate. An opponent model is a representation of another agent which is given or obtained via opponent modelling. Opponent-awareness refers to an agent’s ability to consider opponents in its own updates, either through direct access to opponent parameters or through approximation with opponent modelling.

Although modelling an opponent in planning is not a recent discovery Riley and Veloso (2002); Albrecht and Stone (2018); Ahmadi and Stone (2007), there is currently no approach which actually leverages opponent-awareness to improve social interaction between agents. The difficulty of opponent-aware planning in general-sum games lies in the lack of available and adequate opponent models in addition to the environment model. Special cases like cooperative or competitive games permit solutions which alleviate this problem by additional maximization or minimization operations Littman (1994, 2001); Silver et al. (2016, 2017b), but general-sum games become intractable w.r.t. the number of agents and possible outcomes.

In this paper, we propose Searching with Opponent-Awareness (SOA) as an approach to opponent-aware planning without explicit and a priori opponent models in the context of Monte-Carlo planning. Our contributions are as follows:

  • A novel type of gradient-based multi-armed bandit which leverages opponent-awareness based on Learning with Opponent-Learning Awareness (LOLA) Foerster et al. (2018a).

  • A novel MCTS variant using opponent-aware bandits to consider agent behavior during planning without explicit or a priori opponent models.

  • An evaluation on several general-sum games w.r.t. to social behavior, payoffs, scalability and a comparison with alternative MCTS approaches. We analyze the behavior of planning agents in three iterated matrix games and two gridworld tasks with sequential social dilemmas and find that opponent-aware planners are more likely to cooperate. This leads to higher social welfare and we show that this benefit is especially evident with increasing number of agents when compared to naive planning agents.

2. Related Work

Opponent-awareness has been studied extensively in model-free reinforcement learning literature Carmel and Markovitch (1995); Zhang et al. (2020), notably minimax-Q-learning Littman (1994), Friend-or-Foe Q-learning Littman (2001), policy hill climbing Bowling and Veloso (2001); Xi et al. (2015) and neural replicator dynamics Hennes et al. (2020). In particular, Learning with Opponent-Learning Awareness (LOLA) exploits the update of a naive-learning opponent to achieve higher performance and social welfare Foerster et al. (2018a)

. Further work enhance LOLA learners through higher-order estimates of learning updates

Foerster et al. (2018b) and stabilization of fixed points Letcher et al. (2019). All of these approaches are applied to model-free reinforcement learning, whereas in this paper we study model-based approaches and apply opponent-awareness to planning agents using search.

Model-based state-of-the-art approaches Silver et al. (2016, 2017b, 2017a); Schrittwieser et al. (2020) also model opponents, though their application is restricted to zero-sum games. They are based on UCB1 bandits which select actions deterministically. This is insufficient in general-sum games where Nash equilibria generally consist of mixed strategies that select actions stochastically, as opposed to pure strategies which are deterministic. Furthermore, although these bandits are used to model opponents, the bandits themselves are not opponent-aware and their purpose is not to encourage cooperation or to optimize social welfare in general-sum games. In contrast, our proposed approach models opponents through opponent-aware multi-armed bandits.

Though approaches like GraWoLF Bowling and Veloso (2003) and fictitious play Vrieze and Tijs (1982) are applicable to general-sum games and take opponent behavior into account, note that these are learning algorithms as opposed to our proposed planning method.

Another way to implement opponent-awareness is through inter-agent communication which has also been studied in deep reinforcement learning Foerster et al. (2016); Lowe et al. (2019) and planning Wu et al. (2009, 2011). This is fundamentally different from using opponent modelling which is done decentralized by an individual agent and without assuming a separate channel for information exchange between agents Foerster et al. (2018a); Lowe et al. (2017).

3. Notation

We consider Markov games Littman (1994); Sigaud and Buffet (2010) where is the number of agents, , is the set of states, is the set of joint actions,

is the transition probability function and

is the joint reward function. In this paper, we assume that and always holds and is a given time step. The joint action at time step is given by and the individual action of agent is .

The behavior of agent is given by its policy with and the joint policy of all agents is defined by . The goal is to maximize the expectation of the discounted return at any given state.

(1)

where is the future horizon and is the discount factor. The joint discounted return is .

We measure social welfare with the collective undiscounted return :

(2)

A policy is evaluated with a state value function , which is defined by the expected discounted return at any Bellman (2003); Boutilier (1996) and the current joint policy . The joint state value function evaluates the joint policy . If a policy has a state value function where for any and , given that are constant for all , is a best response to all Shoham and Leyton-Brown (2008); Hamilton (1992). If is a best response for all , is a Nash equilibrium Nash (1950); Shoham and Leyton-Brown (2008); Hamilton (1992).

In game theory, the term strategy is used to describe an agent’s choice of actions at any given state Shoham and Leyton-Brown (2008); Hamilton (1992). There is a distinction between mixed strategies and pure strategies whereby the former type of strategy is stochastic and the latter is deterministic w.r.t. to the state Shoham and Leyton-Brown (2008); Hamilton (1992). In our case, an agent’s mixed strategy is equivalent to its policy.

4. Methods

4.1. Planning

Planning searches for approximate best responses, given a generative model which approximates the actual environment , including and . We assume . Global planning searches the whole state space for a global policy while local planning only considers the current state and possible future states within a horizon to find a local policy. We focus on local planning for online planning, i.e. actions are planned and executed at each time step given a fixed computation budget .

4.2. Multi-armed bandits

Multi-armed bandits (MAB) are agents used to solve decision-making problems with a single state . A bandit repeatedly selects one action or arm and is given a reward . is the set of available actions and is a stochastic variable of an unknown distribution. The goal is to maximize a bandit’s expected reward by estimating the mean reward for all and selecting as .

To learn the optimal MAB policy, a balance between selecting different arms to estimate and selecting is needed. Finding such a balance is known as the exploration-exploitation dilemma, where MAB algorithms trade-off avoidance of poor local optima and convergence speed.

4.2.1. Upper Confidence Bound (UCB1)

UCB1 bandits are commonly used in state-of-the-art approaches to MAB problems Silver et al. (2016, 2017b); Schrittwieser et al. (2020); Kocsis and Szepesvári (2006); Bubeck and Munos (2010). The UCB1 criterion for selecting an action is given by Auer et al. (2002):

(3)

where C is an exploration constant, is the iteration count and is the number of times has been selected. UCB1 bandits always choose through , though if there is more than one such , is chosen from randomly with uniform probability.

4.2.2. Gradient-based (GRAB)

GRAB bandits optimize through a variant of stochastic gradient ascent Sutton and Barto (2018). Each is assigned a numerical preference ( is the iteration count) and the probability of selecting is Sutton and Barto (2018):

(4)

is updated according to the following rule Sutton and Barto (2018):

(5)

is the selected arm, is the learning rate, is the current mean reward of this bandit, and is 1 iff and 0 otherwise.

4.3. Mcts

Figure 1. Small-scale example of states and actions modelled with MCTS. Yellow circles correspond to states and red squares represent the actions chosen by bandit of agent . Dashed lines indicate non-determinism from ’s perspective, because the next state is determined by all agents and the environment.

Example of MCTS with a branching factor of 2. The first node is the tree root and represents the current state. Its children are the actions of the agent which are followed by the next state. Because state transition depends on the joint action of all agents, the edges leading from action nodes to state nodes are non-deterministic.

Our study focuses on Monte Carlo Tree Search (MCTS) which selects actions through lookahead search. The search tree is built from as state nodes and as action nodes, starting from the current state as the root with (s. Fig. 1). Edges from to are given by .

Tree search is done iteratively in four steps:

  • Selection: Starting from the root, is chosen according to the selection strategy which is parameterized by a state and visit count of the state node. Transitions to the next state node are sampled via .

  • Expansion: When a new non-terminal state is reached with and , a corresponding state node is created and added to the last node’s children. The new state node also creates its action node children for .

  • Rollout: The last state node is evaluated with a rollout policy until or a terminal state is reached. In our study, randomly selects actions with uniform probability.

  • Backpropagation: is computed recursively from leaf to root and is updated accordingly.

In our case, the selection strategy is set to the agent policy: at all nodes where is the maximum number of simulations as given by . In this paper, we assume that a constant also leads to constant regardless of actual runtime. Thus, for simplicity reasons, we assume . and are provided by .

In our multi-agent setting, there are multi-armed bandits at each state node representing the joint policy and bandit ’s arms correspond to the actions available to agent . The bandit represents and optimizes the expected discounted return, based on and node visit counts . Note that each agent runs MCTS locally, meaning that selects the own actions and models other agents or opponents for . All are used to predict the complete joint action which is needed for and . Bandits are not shared between agents and thus is used by agent to estimate and .

UCB1 is often used to implement in state-of-the-art approaches Silver et al. (2016, 2017b); Schrittwieser et al. (2020); Kocsis and Szepesvári (2006); Bubeck and Munos (2010), leading to the popular Upper Confidence bound applied to Trees (UCT) algorithm Kocsis and Szepesvári (2006); Silver et al. (2016, 2017b); Schrittwieser et al. (2020).

4.4. Searching with Opponent-Awareness

1:procedure sample()
2:     Initialize action distribution
3:     for  do
4:         Calculate as in Eq. 4      
5:     
6:     return
7:procedure update_expectations()
8:     
9:     
10:     
11:     
12:     return
13:procedure gradients(,)
14:     
15:     return
16:procedure update_preferences()
17:     
18:     
Algorithm 1 OGA bandit algorithm
1:procedure SOA()
2:     Create root state node for
3:     
4:     
5:     while  do
6:         
7:         SIMULATE_STATE()
8:               
9:      from
10:     return
11:procedure simulate_state()
12:     if  then
13:         return 0      
14:     if  then
15:         Create action node children for
16:         Create bandits for
17:         Perform rollout with to sample
18:         return      
19:      for
20:     
21:     
22:     return
23:procedure simulate_action()
24:     
25:     if no state node for in ’s children then
26:         Make new state node child for
27:         
28:         return      
29:     
30:     return
31:procedure update_bandits()
32:      for
33:      for
34:      for
Algorithm 2 SOA

In Searching with Opponent-Awareness (SOA), we leverage the concept of LOLA to improve social interactions between planning agents. LOLA itself is designed for learning agents, as opposed to a planning agent. However, bandits in the search tree actually represent learning instances and thus we develop an opponent-aware bandit which can be integrated into planning algorithms like MCTS.

The key idea behind LOLA is to incorporate the learning step of another agent into one’s own update, in contrast to a naive learner which ignores the updates of other learners in their own update Foerster et al. (2018a). For a naive learning agent whose policy is parameterized by , the update rule is defined by:

(6)

is the state value function approximation of agent as a function of all agents’ policy parameters , is the learning rate of agent . LOLA adds the following term to the update Foerster et al. (2018a):

(7)

As such, the total LOLA update is given by Foerster et al. (2018a):

(8)

Because the UCB1 update is not gradient-based, it cannot be adjusted as in Eq. 8 which is intended for learning agents using policy gradient updates. However, the GRAB bandit maximizes the expected return through stochastic gradient ascent and can thus directly implement the LOLA update rule. We refer to this variant as Opponent-Gradient Aware (OGA) bandit that can be integrated into MCTS as and to simulate at each state node. In this way, we realize opponent-aware planning without an explicit or a priori opponent model. The complete formulation for the OGA bandit algorithm is given in Algorithm 1, where , and represents all with .

SOA uses OGA bandits in the Selection step of MCTS to sample as for all where at each to obtain and from . During Backpropagation, the joint discounted return is used to recursively update all bandits in the path from leaf to root. The complete formulation for SOA is given in Algorithm 2. Note and are accumulated from and for all . The UPDATE_BANDITS procedure can be easily adapted for other bandits like UCB1 and GRAB while leaving out opponent-awareness. We refer to the latter case as GRAB-MCTS.

5. Experiments

We compare SOA, UCT and GRAB-MCTS in different settings and evaluate their performance. The first setting includes iterated matrix games: chicken drive, prisoners’ dilemma and matching pennies. In these games, each agent chooses one of two actions and immediately receives a reward, based on the joint action. Afterwards, the agents are studied in coin game and predator-prey which are more complex than iterated matrix games and require a sequence of actions until a reward is obtained.

In all of our experiments, we set and . In all of our settings, only one type of bandit is used by the agents in a given episode. Episodes are sequences of time steps of length where state transitions are determined by the environment and the actions of all agents. In all of our domains, we set .

5.1. Iterated matrix games

The iterated matrix games we study in this paper represent basic challenges in multi-agent systems as agent behavior directly affects the returns of both agents. The payoff matrices for each game are shown in Tables 1, 2 and 3.

C D
C (-1, -1) (-3, 0)
D (0, -3) (-2, -2)
Table 1. Payoff matrix of Prisoners’ Dilemma

The Iterated Prisoners’ Dilemma (IPD) gives an agent the choice between cooperating (C) and defecting (D). Under the assumption of infinite iterations, there are many Nash Equilibria as shown by the folk theorem Hamilton (1992), including mutual defection and Tit-for-Tat (TFT) with expected returns of -2 and -1 respectively. The latter strategy begins by first cooperating and then imitating the opponent’s previous move.

H T
H (+1, -1) (-1, +1)
T (-1, +1) (+1, -1)
Table 2. Payoff matrix of Matching Pennies

Iterated Matching Pennies (IMP) is a zero-sum game whose only mixed strategy Nash Equilibrium is for both agents to play head (H) or tail (T) with a probability of 50%, leading to expected returns of 0.

C D
C (0, 0) (-1, +1)
D (+1, -1) (-10, -10)
Table 3. Payoff matrix of Chicken Drive

In Iterated Chicken Drive (ICD), agents have the option to chicken (C) or to drive (D). This game also has many different Nash Equilibria, most notably two pure strategies where one agent always chooses C and the other selects D with expected returns of -1 and +1 and vice versa.

5.2. Coin game

Figure 2. Example states for coin game with a red coin. Picking up any coin yields a reward of +1, but the red agent is penalized with -2 if a red coin is picked up by the blue agent and vice versa.

Three possible outcomes of a state where the agents have the opportunity to pick up a red coin. The rewards for each agent are listed when only one or both agents pick up the coin.

Coin game is a sequential game and more complex than iterated matrix games, proposed by Lerer and Peysakhovich as an alternative to IPD Lerer and Peysakhovich (2017). In coin game, the objective for both agents is to collect coins in a grid-world by moving to the coin’s position. There is one red and one blue agent and coins are either red or blue.

Picking up any coin yields a reward of +1 point, but if e.g. the blue agent picks up a red coin, the red agent is penalized with a reward of -2 (and vice versa). A simple greedy policy yields an expected return of 0 Lerer and Peysakhovich (2017); Foerster et al. (2018a).

5.3. Predator-prey

Figure 3. Example states for predator-prey with 3 agents (red) and 2 uncaptured preys (blue). Capturing a prey yields a reward of 1 if only one agent is involved, 0.6 for each agent otherwise. Excluded predators suffer hunger, represented by a -1 penalty.

Three possible state transitions where a prey is captured. The rewards and penalties are shown when all agents share the captured prey, when an agent is excluded and when only one agent captures the prey.

In predator-prey, agents seek to catch preys in a grid-world. Unlike in coin game, preys are mobile targets and not respawned when caught. An episode ends when all preys have been captured or time steps have elapsed.

Agents receive a reward of +1 when capturing a prey alone. When at least one more agent has a Chebyshev distance of 1 or less, the prey is shared among these agents with a reward of +0.6 for each agent. Agents who are farther away are penalized with a reward of -1 to simulate hunger.

6. Results

6.1. Iterated matrix games

(a)
(b)
(c)
Figure 4. Distribution of relative action count per episode for both agents across 100 episodes. (a) Probability of cooperating in IPD. (b) Probability of avoiding collision in ICD. (c) Probability of selecting head in IMP.

In iterated matrix games, the policy of an agent can be conditioned on past states, though Press and Dyson have proven that remembering one past state is sufficient Press and Dyson (2012). Thus, we set the horizon and model the initial state with the root. The policy for subsequent states is derived from the previous state transition which is taken as the tree path from the root to the bandit who selects the next action. is kept constant at 100.

Fig. 3(c) illustrates the distribution of chosen actions for each selection strategy. In IPD, the use of SOA results in a higher probability of mutual cooperation compared to UCT and GRAB-MCTS, shown by the relative distances from the three pointclouds to the upper right corner in Fig 3(a).

We analyze the agents in the same way in ICD and find that all agents converge to mostly playing Chicken, as shown by the concentration of points in the upper right quadrant of Fig. 3(b)

. UCT has the most stable selection strategy in this setting and GRAB-MCTS leads to the highest variance in the probability of playing Chicken. The small groups near the upper middle and the middle right indicate that UCT and SOA agents also have a tendency of playing Drive when the opponent is likely to play Chicken.

In IMP, SOA and GRAB-MCTS converge to the mixed strategy Nash equilibrium where at least one agent’s policy approximates uniform distribution of actions, evidenced by the cluster in the center of Fig

3(c). UCT sometimes develops a preference for playing either heads or tails, shown by the concentrations which surround the center. Nevertheless, if the opponent uses the mixed strategy Nash equilibrium, the expected return is still 0.

6.2. Coin game

(a)
(b)
(c)
Figure 5.

Performance comparison of all three planning algorithms in coin game across 50 episodes per algorithm type per budget. Shaded areas show standard deviation. (a) Probability for each agent to pick up a coin of their own color. (b) Collective undiscounted return

per algorithm type. (c) MCTS runtime per agent per time step.

In coin game, we examine the behavior of agents in a sequential game where the exact impact of single actions of a cooperative or defective strategy on the future return is less obvious than a reward matrix. In this setting, we also compare the computational scalability of all three selection strategies by measuring the time needed to consume their computation budget. The gridworld size is and horizon .

Fig. 4(a) shows the probability of an agent picking up coins of the same color as their own. Both GRAB-MCTS and UCT lead to a mean probability of about 50 %, meaning that these agents pick up coins regardless of its color. SOA agents are more likely to pick up coins of their own color, though this probability decreases with increased . However, as shown in Fig 4(b), this decline is not reflected in the collective return (s. Eq. 2) which actually increases. This is due to the facts that a coin is spawned if and only if a coin has ben picked up and that we set a time limit in our episodes. Therefore, the maximum number of coins to be picked up is 50 (if a coin is picked up at every time step) and decreases with every time step where one agent waits for their opponent to reach their coin. This presents a trade-off between looking for one’s own coin and forcing coin respawns.

As shown in Fig. 4(c), MCTS runtimes with all three bandits are comparable in this setting, even when varies.

6.3. Predator-prey

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 6. Performance comparison of all three planning algorithms in predator-prey across 100 episodes per algorithm per number of agents. Shaded areas show standard deviation. (a, b, c) compare performance with , (d, e, f) with and (g, h, i) with (). (a, d, g) Probability for each agent to be excluded from the capture of a prey. (b, e, h) Collective discounted return per MCTS type. (c, f, i) MCTS runtime per agent per time step.

In predator-prey, we also test the scalability of the bandit (and by extension planning) algorithms w.r.t. the number of agents with while keeping the number of prey constant at 2. We further set the horizon , the gridworld size to and where and . For example, if and , and the gridworld has size . If and , and the gridworld has size .

Fig. 6 displays the accumulated results of agent behavior, collective returns (see Eq. 2) for each episode and MCTS runtime per agent per time step. UCT and GRAB-MCTS agents simply hunt down prey, leading to a high probability for an agent to be penalized with increasing due to preys being limited to 2 instances. This is also reflected in declining collective returns because more agents are penalized.

In contrast, SOA leads to consistently low probability of an agent being penalized and higher collective returns . This is due to an agent’s tendency to avoid prey until other agents are in close proximity. Increasing makes this more difficult, because it requires coordination between more agents. Additionally, preys are moving targets which means that staying still means to risk that preys escape and that increasing (which also increases the gridworld size) leads to a larger state space. Note that not catching any prey results in a return of 0 for all agents, which yields a higher than when many agents are penalized due to defective captures.

This observation can also be explained by examining the difference between naive learning in Eq. 6 and the LOLA update rule in Eq. 8 which is the term defined in Eq. 7. In Eq. 9, we observe the impact of increasing on a naive update and find that the term remains constant.

(9)

However, increasing causes to grow arbitrarily large, as shown in Eq. 6.3.

Therefore, with each additional agent in a game, the naive learning component of a LOLA-learner (and by extension our OGA bandit’s) is outweighed by the naive updates of all other agents. Eq. 6.3 shows the ratio between the naive update and converges to 0 with infinite .

This can be adjusted by assigning weights or by normalizing , though we leave this for future work.

Fig. 5(c), 5(f) and 5(i) show that SOA has steeper runtime demands than UCT and GRAB-MCTS with increasing . This is a side-effect of its update rule which leads to additional computation per agent.

Note the results in Fig. 6 remain stable with different budgets as shown by setting different values for .

7. Conclusion

In this paper, we presented SOA, an MCTS variant using opponent-aware bandits in MCTS to improve social interaction and performance in multi-agent planning.

For that, we introduced OGA bandits which extend gradient-based multi-armed bandits with LOLA. This is done by adapting the GRAB bandit algorithm which is not possible for UCB1 because UCB1 updates are not gradient-based.

These OGA bandits are used in SOA to implement the tree selection policy of the MCTS and to simulate the joint policy at each state. During Backpropagation, the joint discounted return is used for opponent-aware updates of all bandits in each visited node.

To evaluate SOA, we compared its performance in several environments to other MCTS variants with different bandit algorithms. Our experiments show that planning with opponent-awareness leads to more cooperation at the cost of more expensive computation for each agent. The benefits of SOA are especially noticeable with increasing number of agents which is useful for scaling multi-agent systems.

Opponent-aware planning may help to improve social interaction between agents in real-world applications where naive decision-making can lead to undesirable consequences, like accidents involving autonomous cars or wasted resources in an industrial factory.

For the future, we intend to study SOA in hybrid approaches with deep reinforcement learning.

References

  • (1)
  • Ahmadi and Stone (2007) Mazda Ahmadi and Peter Stone. 2007. Instance-Based Action Models for Fast Action Planning. 1–16. https://doi.org/10.1007/978-3-540-68847-1_1
  • Albrecht and Stone (2018) Stefano V. Albrecht and Peter Stone. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence 258 (2018), 66–95. https://doi.org/10.1016/j.artint.2018.01.002
  • Anthony et al. (2017) Thomas Anthony, Zheng Tian, and David Barber. 2017.

    Thinking Fast and Slow with Deep Learning and Tree Search. In

    Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
    https://proceedings.neurips.cc/paper/2017/file/d8e1344e27a5b08cdfd5d027d9b8d6de-Paper.pdf
  • Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 (05 2002), 235–256. https://doi.org/10.1023/A:1013689704352
  • Bellman (2003) Richard Ernest Bellman. 2003. Dynamic Programming. Dover Publications, Inc., USA.
  • Boutilier (1996) Craig Boutilier. 1996. Planning, Learning and Coordination in Multiagent Decision Processes. In Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge (The Netherlands) (TARK ’96). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 195–210.
  • Bowling and Veloso (2001) Michael Bowling and Manuela Veloso. 2001. Rational and Convergent Learning in Stochastic Games. In Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2 (Seattle, WA, USA) (IJCAI’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1021–1026.
  • Bowling and Veloso (2003) Michael Bowling and Manuela Veloso. 2003. Simultaneous Adversarial Multi-Robot Learning. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (Acapulco, Mexico) (IJCAI’03). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 699–704.
  • Bubeck and Munos (2010) Sébastien Bubeck and Remi Munos. 2010. Open Loop Optimistic Planning. COLT 2010 - The 23rd Conference on Learning Theory, 477–489.
  • Carmel and Markovitch (1995) David Carmel and Shaul Markovitch. 1995. Opponent Modeling in Multi-Agent Systems. In Proceedings of the Workshop on Adaption and Learning in Multi-Agent Systems (IJCAI ’95). Springer-Verlag, Berlin, Heidelberg, 40–52.
  • Foerster et al. (2016) Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc., 2137–2145. https://proceedings.neurips.cc/paper/2016/file/c7635bfd99248a2cdef8249ef7bfbef4-Paper.pdf
  • Foerster et al. (2018a) Jakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. 2018a. Learning with Opponent-Learning Awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (Stockholm, Sweden) (AAMAS ’18). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 122–130.
  • Foerster et al. (2018b) Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rocktäschel, Eric Xing, and Shimon Whiteson. 2018b. DiCE: The Infinitely Differentiable Monte Carlo Estimator. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, Stockholmsmässan, Stockholm Sweden, 1529–1538. http://proceedings.mlr.press/v80/foerster18a.html
  • Hamilton (1992) Jonathan Hamilton. 1992. Game theory: Analysis of conflict, by Myerson, R. B., Cambridge: Harvard University Press. Managerial and Decision Economics 13, 4 (1992), 369–369. https://doi.org/10.1002/mde.4090130412 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/mde.4090130412
  • Hennes et al. (2020) Daniel Hennes, Dustin Morrill, Shayegan Omidshafiei, Rémi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas, Edgar Duèñez Guzmán, and Karl Tuyls. 2020. Neural Replicator Dynamics: Multiagent Learning via Hedging Policy Gradients. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 492–501.
  • Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. 2006. Bandit Based Monte-Carlo Planning. Machine Learning: ECML 2006, 282–293. https://doi.org/10.1007/11871842_29
  • Lerer and Peysakhovich (2017) A. Lerer and Alexander Peysakhovich. 2017. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. ArXiv abs/1707.01068 (2017).
  • Letcher et al. (2019) Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. 2019. Stable Opponent Shaping in Differentiable Games. In International Conference on Learning Representations. https://openreview.net/forum?id=SyGjjsC5tQ
  • Littman (1994) Michael L. Littman. 1994. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Proceedings of the Eleventh International Conference on International Conference on Machine Learning (New Brunswick, NJ, USA) (ICML’94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 157–163.
  • Littman (2001) Michael L. Littman. 2001. Friend-or-Foe Q-Learning in General-Sum Games. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 322–328.
  • Lowe et al. (2019) Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. 2019. On the Pitfalls of Measuring Emergent Communication. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (Montreal QC, Canada) (AAMAS ’19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 693–701.
  • Lowe et al. (2017) Ryan Lowe, YI WU, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/68a9750337a418a86fe06c1991a1d64c-Paper.pdf
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. (2013). http://arxiv.org/abs/1312.5602 cite arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.
  • Nash (1950) J. Nash. 1950. Equilibrium Points in N-Person Games. Proceedings of the National Academy of Sciences of the United States of America 36 1 (1950), 48–9.
  • Press and Dyson (2012) William H. Press and Freeman J. Dyson. 2012. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences 109, 26 (2012), 10409–10413. https://doi.org/10.1073/pnas.1206569109 arXiv:https://www.pnas.org/content/109/26/10409.full.pdf
  • Riley and Veloso (2002) Patrick Riley and Manuela Veloso. 2002. Planning for Distributed Execution through Use of Probabilistic Opponent Models. In Proceedings of the Sixth International Conference on Artificial Intelligence Planning Systems (Toulouse, France) (AIPS’02). AAAI Press, 72–81.
  • Schrittwieser et al. (2020) Julian Schrittwieser, Ioannis Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, Edward Lockhart, Demis Hassabis, T. Graepel, T. Lillicrap, and D. Silver. 2020. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature 588 7839 (2020), 604–609.
  • Shoham and Leyton-Brown (2008) Yoav Shoham and Kevin Leyton-Brown. 2008. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, USA.
  • Sigaud and Buffet (2010) Olivier Sigaud and Olivier Buffet. 2010. Markov Decision Processes in Artificial Intelligence. Wiley-IEEE Press.
  • Silver et al. (2016) David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016.

    Mastering the game of Go with deep neural networks and tree search.

    Nature 529 (2016), 484–503. http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html
  • Silver et al. (2017a) D. Silver, T. Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, A. Guez, Marc Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and Demis Hassabis. 2017a. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. ArXiv abs/1712.01815 (2017).
  • Silver et al. (2017b) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George Driessche, Thore Graepel, and Demis Hassabis. 2017b. Mastering the game of Go without human knowledge. Nature 550 (10 2017), 354–359. https://doi.org/10.1038/nature24270
  • Silver and Veness (2010) David Silver and Joel Veness. 2010. Monte-Carlo Planning in Large POMDPs. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2 (Vancouver, British Columbia, Canada) (NIPS’10). Curran Associates Inc., Red Hook, NY, USA, 2164–2172.
  • Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA.
  • Vinyals et al. (2019) Oriol Vinyals, I. Babuschkin, W. Czarnecki, Michaël Mathieu, Andrew Dudzik, J. Chung, D. Choi, R. Powell, Timo Ewalds, P. Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, A. S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, D. Budden, Yury Sulsky, James Molloy, T. L. Paine, Caglar Gulcehre, Ziyu Wang, T. Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, Demis Hassabis, Chris Apps, and D. Silver. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature (2019), 1–5.
  • Vrieze and Tijs (1982) O. Vrieze and S. Tijs. 1982. Fictitious play applied to sequences of games and discounted stochastic games. International Journal of Game Theory 11 (1982), 71–85.
  • Wu et al. (2009) Feng Wu, Shlomo Zilberstein, and Xiaoping Chen. 2009. Multi-Agent Online Planning with Communication. In Proceedings of the Nineteenth International Conference on International Conference on Automated Planning and Scheduling (Thessaloniki, Greece) (ICAPS’09). AAAI Press, 321–328.
  • Wu et al. (2011) Feng Wu, Shlomo Zilberstein, and Xiaoping Chen. 2011. Online planning for multi-agent systems with bounded communication. Artificial Intelligence 175, 2 (2011), 487–511. https://doi.org/10.1016/j.artint.2010.09.008
  • Xi et al. (2015) Lei Xi, Tao Yu, Bo Yang, and Xiaoshun Zhang. 2015. A novel multi-agent decentralized win or learn fast policy hill-climbing with eligibility trace algorithm for smart generation control of interconnected complex power grids. Energy Conversion and Management 103 (2015), 82–93. https://doi.org/10.1016/j.enconman.2015.06.030
  • Zhang et al. (2020) Yijie Zhang, Roxana Rădulescu, Patrick Mannion, Diederik M. Roijers, and Ann Nowé. 2020. Opponent Modelling for Reinforcement Learning in Multi-Objective Normal Form Games. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2080–2082.