The Partially Observable Asynchronous Multi-Agent Cooperation Challenge

12/07/2021
by   Meng Yao, et al.
Tsinghua University
0

Multi-agent reinforcement learning (MARL) has received increasing attention for its applications in various domains. Researchers have paid much attention on its partially observable and cooperative settings for meeting real-world requirements. For testing performance of different algorithms, standardized environments are designed such as the StarCraft Multi-Agent Challenge, which is one of the most successful MARL benchmarks. To our best knowledge, most of current environments are synchronous, where agents execute actions in the same pace. However, heterogeneous agents usually have their own action spaces and there is no guarantee for actions from different agents to have the same executed cycle, which leads to asynchronous multi-agent cooperation. Inspired from the Wargame, a confrontation game between two armies abstracted from real world environment, we propose the first Partially Observable Asynchronous multi-agent Cooperation challenge (POAC) for the MARL community. Specifically, POAC supports two teams of heterogeneous agents to fight with each other, where an agent selects actions based on its own observations and cooperates asynchronously with its allies. Moreover, POAC is a light weight, flexible and easy to use environment, which can be configured by users to meet different experimental requirements such as self-play model, human-AI model and so on. Along with our benchmark, we offer six game scenarios of varying difficulties with the built-in rule-based AI as opponents. Finally, since most MARL algorithms are designed for synchronous agents, we revise several representatives to meet the asynchronous setting, and the relatively poor experimental results validate the challenge of POAC. Source code is released in <http://turingai.ia.ac.cn/data_center/show>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 6

page 7

02/11/2019

The StarCraft Multi-Agent Challenge

In the last few years, deep multi-agent reinforcement learning (RL) has ...
03/14/2020

Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control

Deep multi-agent reinforcement learning (MARL) holds the promise of auto...
06/05/2019

Finding Friend and Foe in Multi-Agent Games

Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dot...
11/27/2021

Normative Disagreement as a Challenge for Cooperative AI

Cooperation in settings where agents have both common and conflicting in...
04/13/2022

Copiloting Autonomous Multi-Robot Missions: A Game-inspired Supervisory Control Interface

Real-world deployment of new technology and capabilities can be daunting...
08/13/2021

Q-Mixing Network for Multi-Agent Pathfinding in Partially Observable Grid Environments

In this paper, we consider the problem of multi-agent navigation in part...
09/05/2021

Soft Hierarchical Graph Recurrent Networks for Many-Agent Partially Observable Environments

The recent progress in multi-agent deep reinforcement learning(MADRL) ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world problems can be modeled as cooperation of multiple agents, such as self-driving cars [2], multi-robot control [14], the networking of communication packages [12], and the trading on financial markets [13]. Recently, A significant part of the research on multi-agent cooperation focuses on reinforcement learning techniques, leading to repaid development of multi-agent reinforcement learning (MARL). Plenty of works have been developed such as [4] [1] [3] [7] and promising performances have been obtained. One of key factors that promotes MARL is game environments, where new MARL algorithms can be quickly tested in a safe and reproducible manner. As one of the most successful environments, the StarCraft Multi-Agent Challenge (SMAC) [18], concentrates on providing benchmarks for partially observable, cooperative, multi-agent learning problem, based on which, a lot of influential works have been verified. Unlike SMAC, whose cooperation are synchronous, we propose, to the best of our knowledge, the first partially observable asynchronous multi-agent cooperation challenge (POAC) for the MARL community, which we argue is more general than synchronous cooperation in real world.

Fig. 1: A game instance of wargame.

POAC is inspired from Wargame, a confrontation game between two armies, where two teams of heterogeneous agents fight against each other in a partially observable environment, and most importantly cooperate asynchronously for beating enemies. Generally, Wargame is a very complex game with plenty of rules such as moving rules, shooting rules and ruling rules. A screenshot is shown in Figure 1111Come from http://turingai.ia.ac.cn/. Directly bringing in a complete Wargame as a benchmark for testing MARL algorithms is impractical, because much effort will be wasted to learn skills unrelated with asynchronous cooperation such as exploring key points in a large map, learning to shoot with proper weapons under random adjudication. So, we abstract POAC from Wargame by keeping the main features and dropping some specific domain knowledge to make it a universal game environment. Specifically, POAC is an asynchronous, heterogeneous, real-time, imperfect information game with stochasticity (state transition is stochastic).

We tried our best to make POAC a light weight, flexible and easy to use benchmark:

  • POAC can be easily configured by users to meet different experimental requirements such as switching self-play, human control and learning against built-in bot modes.

  • POAC offers six game scenarios with various built-in rule-based AIs as opponents, based on which, researchers can train their asynchronous operation strategies.

  • We revise several representative MARL algorithms to meet the asynchronous setting, and the experimental result and implementation code provide baselines for community.

2 Related Work

There are several multi-agent game benchmarks that have promoted the development of MARL algorithms. Multiple gridworld-like environments [11] are a set of simple grid-world like environments for multi-agent RL with an implementation of MADDPG for mix of competitive and cooperative tasks, which focus on shared communication and low level continuous control. [10] is a mixed-cooperative Markov environment aiming to test social dilemmas. [22] focuses on testing emergent behaviour, and presents a framework for creating gridworlds covering hundreds to the millions of agents with relatively simple game rules. [17] proposes a multi-agent environment based on the game Bomberman, consisting of a series of cooperative and adversarial tasks to be more challenging but with simple grid-world-like action spaces. The SMAC [18], which is a representative of the challenging multi-agent game, has been used as a test-bed for various MARL algorithms and so does the recently proposed Google Research Football [9] environment. However, we argue some of them do not provide qualitatively challenging benchmark environments for testing MARL algorithms and almost all of them ignore asynchronous cooperation, which is common tasks in real world. Although the most recently developed Fever Basketball[8] benchmark is an asynchronous cooperation sports game environment for MARL community, such environment shares perfect information for every agents, which is usually unpractical.

3 Poca

3.1 POCA environment

Generally, operators from red and blue teams fight against each other on a specific map by obeying pre-given fighting rules in a wargame. Similarly, POCA consists of three basic elements: operator, map and rule. We utilize three different types of operators: chariot, tank and infantry, with each of them having unique attributes. We will elaborate this in the following parts. Like Wargame, POCA uses hexagonal map for more freedom of movement, as shown in Figure 2.

Fig. 2: An example of map, the solid girds are special grids with hidden terrain property.

As for the rules, there are several distinct rules making POCA different from previous benchmarks.

  • Some hexagons of the map have hidden terrain property, where it is difficult for enemies to observe when operators located in because of shortened observed distance. As shown in Figure 3. Based on such a rule, partially observable feature can be reflected even when different operators are fighting in a nearby distance.

  • Move, as a very important action in POAC, requires different time steps moving to an adjacent hexagon for different operators, which is a key factor making POCA an asynchronous game. This is shown in Figure 4.

  • When an agent attacks another agent, it causes damage based on random number seed, making POCA a random environment, i.e., uncertain in state transition even when the state is fully observable.

  • A special cooperation rule, called guide shoot, needs two operators cooperate in a particular way so as to cause damage that may not be completed alone, which brings new challenges to MARL agent for achieving efficient cooperation.

Fig. 3: The solid girds are special grids with hidden terrain property. For example, tank’s observed distance is 10, but when tank is in a special terrain, its observed distance would be 5. The red tank (little red square) is 9 hexagons away from the blue tank (little green square), so the blue tank can see the red tank, but the red tank can’t see the blue tank, because the blue tank is in a special terrain.
Fig. 4: Different moving speeds resulting in asynchronous cooperation.

3.2 POCA for MARL

POAC is a partially observable asynchronous multi-agent cooperation, designed for cooperative multi-agent tasks described as Dec-POMDPs [15]. Formally, a Dec-POMDP is given by a tuple , and is the true state of the environment. At each time step, each agent chooses an action and forms a joint action , which causes a transition of the environment according to the state transition function . In a partially-observable stochastic game, usually all agents in a Dec-POMDP share the same team reward , and is the discount factor. Dec-POMDPs consider partially observable scenarios where an observation function determines the observations of the agent. Each agent has an action-observation history on which it conditions a policy . The joint policy forms a joint action-value function: , and is the discounted return.

In most multi-agent benchmarks, synchrony between actions is an implicit assumption. It is called synchronous if there is a global clock and agents move in lockstep and a step in the system corresponds to a tick of the clock. However, in an asynchronous cooperation system, there is no global clock, and the agents in the system can run at arbitrary rates relative to each other [6]. In POCA, heterogeneous agents have their own action spaces and executed cycle, which are different from each other. Because of this, Dec-POMDPs should be modified for asynchronous tasks. Specifically, at each time step (suppose a global clock is existed), each agent chooses an action forming a joint action , where is actions of agents that can perform actions. This will cause a transition of the environment .

Next we describe key factors for designing MARL algorithms on POCA, based on which, we implement representative centralised training with decentralised execution MARL algorithms, i.e., QMIX[16],QTRAN[19],COMA[5],VDN[20] and IQL[21], as baselines to test our environment.

States and Observations. At each timestep, agents receive local observations within their field of view. The sight range makes the environment partially observable from the standpoint of each agent. Agents can only observe other agents if they are both alive and located within the sight range. In POCA, We use the attributes information of the operator as state and observation, and a general summarization is shown in Table I. Differen operators share very distinct characteristics, as shown in Tables II, III and IV

. From the tables, we can see that tank has the most health, no shoot preparation time, higher probability of attack damage, so tank is suitable for charging. For chariot, it can make the most damage to tank and it is also main player for using guide shoot, but because of its shoot preparation time, it can not shoot while move, so chariot should be well protected. At last, infantry has slowest movement speed, but it can take the most damage, and because its observed distance is so short that it can not be easily detected by the enemy, so it can hide for finding enemies and cooperate with chariot to fight with enemies using guide shoot.

Attributes Description
color 0 (red) or 1 (blue)
id decimal number
type 0 for tank, 1for chariot, and 2 for infantry
blood current blood
position current position
speed move speed
observed distance max distance for an operator to be observed
attacked distance max distance for an operator to be attacked
attack damage against tank or chariot damage obtained when attacking tank or chariot operators
probability of causing damage against tank or chariot probability of causing damage when attacking tank or chariot operators
attack damage against infantry damage obtained when attacking infantry
probability of causing damage against infantry probability of causing damage when attacking infantry
shoot cool-down cool-down time after an attack
shoot preparation time time required to attack enemy operators
move time time from the beginning of move to the present
stop time time from the beginning of stop to the present
can guide shoot true or false
TABLE I: Operators attributes.
Tank
blood 10
speed 1
observed distance 10
attacked distance 7
attack damage against tank or chariot 1.2
probability of causing damage against tank or chariot 0.8
attack damage against infantry 0.6
Probability of causing damage against infantry 0.6
shoot cool-down 1
shoot preparation time 0
can guide shoot False
TABLE II: Operators details for tank.
Chariot
blood 8
speed 1
observed distance 10
attacked distance 7
attack damage against tank or chariot 1.5
probability of causing damage against tank or chariot 0.7
attack damage against infantry 0.8
Probability of causing damage against infantry 0.6
shoot cool-down 1
shoot preparation time 2
can guide shoot True
TABLE III: Operators details for chariot.
Infantry
blood 7
speed 0.2
observed distance 5
attacked distance 3
attack damage against tank or chariot 0.8
probability of causing damage against tank or chariot 0.7
attack damage against infantry 0.8
Probability of causing damage against infantry 0.6
shoot cool-down 1
shoot preparation time 2
can guide shoot True
TABLE IV: Operators details for infantry.

The feature vector observed by each agent contains important attributes for both allied and enemy units within the sight range such as faction (color), id, type, current position, blood, as summarized in Table

V. Besides, the time information and the can_see and can_attack are considered. can_see shows enemy index within agent’s sight, and can_attack shows enemy index which can be attacked by agent. Finally, the global state, which is only available to agents during centralised training, contains perfect information about all operators on the map. All features, both in the state as well as in the observations of individual agents, are normalised by their maximum values. It is worth mentioning that we provide map feature for more comprehensive information with open interface.

allied_feature type placeholder local/global state
color float 1 True/True
id float 1 True/True
type float 1 True/True
cur_hex float 1 True/True
blood float 1 True/True
move_time float 1 True/True
stop_time float 1 True/True
shoot_cooling_time float 1 True/True
can_see float 3 True/True
can_attack float 3 True/True
enemy_feature type placeholder local/global state
color float 1 True/True
id float 1 True/True
type float 1 True/True
cur_hex float 1 True/True
blood float 1 True/True
move_time float 1 False/True
stop_time float 1 False/True
shoot_cooling_time float 1 False/True
can_see float 3 False/True
can_attack float 3 False/True
clock_feature type placeholder local/global state
time_step float 1 True/True
TABLE V: Feature vectors for each agent.

Asynchronous action Space. The discrete set of actions that agents are allowed to act consists of movement in six directions (one hexagon per time), shoot[enemy_id], guide shoot[enemy_id] and stop. Agents moving to next hex must cost specific time steps: tank and chariot costing 1 time step and infantry for 5 time steps (show in figure 4). This difference cause synchronization. For chariot and infantry, after arriving destination, if they want to take shoot or guide shoot action, they should keep stop for several specific time steps, and the shoot and guide shoot action only occur when there have been enemies in sight and in shooting range with the shoot cooling time finished. But for tank, it can shoot while moving, and tank also has shoot cooling time. Tank can not take guide shoot, but the ability of shoot while moving makes it very powerful. Instead, even though the infantry and chariot must perform stop before shoot, they can cooperate with each other to complete the guide shoot.

Rewards. The overall goal is to maximise the win rate for each battle. The default reward setting is to use a reward dealt by calculating the difference between the loss of allied health and the loss of the enemy’s health.

3.3 Six scenarios

Along with our published environment, we release six scenarios as basic benchmarks for developing asynchronous MARL algorithms. Operator details of different scenarios are shown in Table VI, and screenshots are displayed in Figure 9.

scenarios map size operators special terrain guide shoot
scenario 0 (13,23) tank, chariot and infantry in red and blue teams False False
scenario 1 (13,23) tank, chariot and infantry in red and blue teams True False
scenario 2 (17,27) tank, chariot and infantry in red and blue teams True True
scenario 3 (27,37) tank, chariot and infantry in red and blue teams True True
scenario 4 (27,37) tank, chariot and infantry in red and blue teams True True
scenario 5 (67,77) tank, chariot and infantry in red and blue teams True True
TABLE VI: Operators attributes in different scenarios.

Built-in Opponent Bots. POCA offers built-in Bot mode, where the RL agents can fight with the built-in rule-based bots with different strategies:

  • KAI0: a rush AI with the highest attack priority and directly going to the center of the battle field.

  • KAI1: an AI keeps hiding to ambush near special terrain, which is a commonly used strategy in Wargames.

  • KAI2: an AI uses guided shoot, i.e., let operators squat on special terrain for a long time and attack enemies with guide shoot because of the advantage of sight, and an example is shown in Figure 5.

Fig. 5: KAI2 red fight against KAI2 blue in scenario 3.

Control modes. Along with the above Built-in bots model, POCA provides another two control modes. The first one is Self-Play mode for reinforcement learning. The second mode is the human mode, and human players can use POCA for fighting against the built-in bots, RL agents and other human players.

Environment modification. POCA also provides function to modify the map and game settings by editing json file in code. For example, edit ”initHex” can change the position of operator at the beginning of the game, as shown in figure 6. Specifically, by modifying other attribute settings such as move speed, POCA can degenerate to a synchronous cooperation game. Moreover, a map can be changed by editing the hdf5 file as shown in figure 7. Details can be found in our released code.

Fig. 6: How changes of Json file impact scenario.
Fig. 7: How changes of hdf5 file impact map.
Fig. 8: Feature vectors in all the experiments.
Fig. 9: We show all built-in bots fighting against themselves in all scenarios. The first, second and third lines are KAI0, KAI1 and KAI2 in all scenarios, respectively. We can see KAI0 rush to center of the battlefield, and if there has been special terrain in map, KAI0’s operators may go through the special terrain without staying too much. KAI1’s operators pay more attention on special terrain, and the tank and the chariot always cooperate together nearby the special terrain. KAI2’s chariot and infantry will go near the special terrain and hide for guiding shoot, and tank will rush to the front to attract firepower.
Fig. 10: Results of MARL methods on POCA tasks. Red lines are QMIX, blue lines are IQL, yellow lines are VDN, green lines are qtran, and black lines are coma. scenario0-KAI0-red represents MARL agent as red team training with KAI0 as opponent in scenario 0.

4 Experiments

4.1 Settings

In this section, we present experiments on POCA tasks of six scenarios. The purpose of these experiments is to provide baselines for future study, and highlight how changes of scenarios challenge existing popular MARL methods. All experiments use the default rewards throughout all scenarios, as introduced in previous section. The agent local observations used in the experiments include features in table V, and we organize the feature to be a vector as shown in figure 8. As for evaluation, the training is paused after every 10000 time steps during which 32 test episodes are run with agents performing action selection greedily in a decentralised fashion. The percentage of wining episodes is reported, where the agents’ blood is larger than enemies within 600 time steps is referred as win. The architecture of each agent network is a DRQN with one recurrent layer. Specifically, the network consits of a GRU with a 64-dimensional hidden state, and a fully-connected layer before and after. Our baseline for all tasks is based on PyMARL with similar experiments setting222https://github.com/oxwhirl/pymarl.

4.2 Results and analysis

Built-in bots vs. build-in bots. We design three different built-in bots, by using distinct types of strategies. All the built-in bots can be applied in all scenarios and served as red or blue. Table VII shows results of built-in bots combats with 32 running times. In scenario 0, there is no special terrain, and when KAI0-red fights with KAI0-blue, the win rate is close to 0.5. In all scenarios, KAI1 has relatively good performance, but KAI2 only performs well in several scenarios. We guess it is the limited design of using guide shoot, which is alleviated by MARL algorithms in the latter experiments. Moreover, strategies of the three built-in bots are briefly displayed in Figure 9.

scenario 0 KAI0-blue
KAI0-red 0.438/284.3
scenario 1 KAI0-blue KAI1-blue
KAI0-red 0.438/359.7 0.063/510.3
KAI1-red 0.625/510.3 0.187/588.1
scenario 2 KAI0-blue KAI1-blue KAI2-blue
KAI0-red 0.375/438.0 0.282/564.8 0.031/591.78
KAI1-red 0.531/501.1 0.562/600.0 0.593/600.0
KAI2-red 0.625/586.2 0.344/600.0 0.094/600.0
scenario 3 KAI0-blue KAI1-blue KAI2-blue
KAI0-red 0.438/433.3 0.031/579.2 0.000/600.0
KAI1-red 0.594/548.6 0.156/600.0 0.781/600.0
KAI2-red 0.750/600.0 0.031/600.0 0.218/600.0
scenario 4 KAI0-blue KAI1-blue KAI2-blue
KAI0-red 0.406/509.9 0.282/538.3 0.063/600.0
KAI1-red 0.594/516.3 0.375/582.8 0.688/600.0
KAI2-red 0.813/534.9 0.125/600.0 0.250/600.0
scenario 5 KAI0-blue KAI1-blue KAI2-blue
KAI0-red 0.344/531.3 0.125/570.6 0.031/597.9
KAI1-red 0.781/579.2 0.063/600.0 0.875/600.0
KAI2-red 0.469/585.7 0.188/600.0 0.375/600.0
TABLE VII: Results of built-in bots fighting with each other in all scenarios, and the number xxx/xxx means win rate/average time steps when game finished.

Multi-agent reinforcement learning bots. For MARL experiments, i.e., QMIX[16], QTRAN[19], COMA[5], VDN[20] and IQL[21], we use their original algorithms to perform asynchronous cooperation by using empty action when no executed actions can be obtained for an agent. we choose results of best seeds, and the results are shown in Figure 10. We can see most agents can perform well when fighting against KAI0, but in some harder scenarios like in sceniro5, none of the agents can win more than 10%. As for opponents KAI1 and KAI2, almost all the agents cannot perform well. Based on the results, we believe POAC with the built-in bots can serve as a perfect benchmark for MARL algorithms under asynchronous cooperation settings.

From the training and testing replays, we find almost all algorithms can learn strategy of guide shoot in smaller sized maps, and this may be the reason for improvement of win rate. In scenarios 4 and 5, all algorithms perform bad, and we review some replays, as show in Figures 11 and 12, indicating success and failure cases, respectively. We believe researchers can analyse and improve their MARL algorithms through our provided replay function.

Fig. 11: Agents move back and forth in the position of the arrow, showing no effective strategies learned.
Fig. 12: The red agents will meet the enemy first, then fight against the enemy, and finally retreat to the origin, focusing on attacking the single enemy. Overall, the red agents successfully learn to use vision.

5 Conclusion

In this paper, we have presented POAC as a benchmark for partially observable asynchronous multi-agent cooperation challenge. Inspired by Wargame, POAC provides six diverse combat scenarios with different types of built-in bots, which largely challenge current MARL methods. In the near future, we aim to extend POAC with more challenging scenarios such as increasing numbers of operators and rules. We look forward to accepting contributions from the community to POCA and also hope it will become a standard benchmark for measuring progress in cooperative MARL.

References

  • [1] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch (2019) Emergent tool use from multi-agent interaction. Machine Learning, Cornell University. Cited by: §1.
  • [2] M. Bansal, A. Krizhevsky, and A. Ogale (2018) Chauffeurnet: learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079. Cited by: §1.
  • [3] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch (2017) Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748. Cited by: §1.
  • [4] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
  • [5] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: §3.2, §4.2.
  • [6] J. Y. Halpern (2007)

    Computer science and game theory: a brief survey

    .
    arXiv preprint cs/0703148. Cited by: §3.2.
  • [7] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al. (2019) Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364 (6443), pp. 859–865. Cited by: §1.
  • [8] H. Jia, Y. Hu, Y. Chen, C. Ren, T. Lv, C. Fan, and C. Zhang (2020) Fever basketball: a complex, flexible, and asynchronized sports game environment for multi-agent reinforcement learning. arXiv preprint arXiv:2012.03204. Cited by: §2.
  • [9] K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, et al. (2019) Google research football: a novel reinforcement learning environment. arXiv preprint arXiv:1907.11180. Cited by: §2.
  • [10] J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel (2017) Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037. Cited by: §2.
  • [11] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275. Cited by: §2.
  • [12] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, and I. K. Dong (2019) Applications of deep reinforcement learning in communications and networking: a survey. IEEE Communications Surveys Tutorials PP (99), pp. 1–1. Cited by: §1.
  • [13] T. Lux and M. Marchesi (1999) Scaling and criticality in a stochastic multi-agent model of a financial market. Nature 397 (6719), pp. 498–500. Cited by: §1.
  • [14] L. Matignon, L. Jeanpierre, and A. I. Mouaddib (2013)

    Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes

    .
    philosophical transactions of the royal society a mathematical physical engineering sciences. Cited by: §1.
  • [15] F. A. Oliehoek and C. Amato (2016) A concise introduction to decentralized pomdps. Springer. Cited by: §3.2.
  • [16] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §3.2, §4.2.
  • [17] C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. Togelius, K. Cho, and J. Bruna (2018) Pommerman: a multi-agent playground. arXiv preprint arXiv:1809.07124. Cited by: §2.
  • [18] M. Samvelyan, T. Rashid, C. S. De Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C. Hung, P. H. Torr, J. Foerster, and S. Whiteson (2019) The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043. Cited by: §1, §2.
  • [19] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi (2019) Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5887–5896. Cited by: §3.2, §4.2.
  • [20] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296. Cited by: §3.2, §4.2.
  • [21] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente (2017) Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4), pp. e0172395. Cited by: §3.2, §4.2.
  • [22] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018) Mean field multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5571–5580. Cited by: §2.