Log In Sign Up

Coach-assisted Multi-Agent Reinforcement Learning Framework for Unexpected Crashed Agents

by   Jian Zhao, et al.

Multi-agent reinforcement learning is difficult to be applied in practice, which is partially due to the gap between the simulated and real-world scenarios. One reason for the gap is that the simulated systems always assume that the agents can work normally all the time, while in practice, one or more agents may unexpectedly "crash" during the coordination process due to inevitable hardware or software failures. Such crashes will destroy the cooperation among agents, leading to performance degradation. In this work, we present a formal formulation of a cooperative multi-agent reinforcement learning system with unexpected crashes. To enhance the robustness of the system to crashes, we propose a coach-assisted multi-agent reinforcement learning framework, which introduces a virtual coach agent to adjust the crash rate during training. We design three coaching strategies and the re-sampling strategy for our coach agent. To the best of our knowledge, this work is the first to study the unexpected crashes in the multi-agent system. Extensive experiments on grid-world and StarCraft II micromanagement tasks demonstrate the efficacy of adaptive strategy compared with the fixed crash rate strategy and curriculum learning strategy. The ablation study further illustrates the effectiveness of our re-sampling strategy.


Promoting Coordination Through Electing First-moveAgent in Multi-Agent Reinforcement Learning

Learning to coordinate among multiple agents is an essential problem in ...

Cooperative Assistance in Robotic Surgery through Multi-Agent Reinforcement Learning

Cognitive cooperative assistance in robot-assisted surgery holds the pot...

An Optimal Rewiring Strategy for Reinforcement Social Learning in Cooperative Multiagent Systems

Multiagent coordination in cooperative multiagent systems (MASs) has bee...

Learning to flock through reinforcement

Flocks of birds, schools of fish, insects swarms are examples of coordin...

Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Video Recognition has drawn great research interest and great progress h...

Meta-CPR: Generalize to Unseen Large Number of Agents with Communication Pattern Recognition Module

Designing an effective communication mechanism among agents in reinforce...

1 Introduction

Cooperative multi-agent systems widely exist in various domains, where a group of agents need to coordinate with each other to maximize the team reward [1, 2]. Such a setting can be broadly applied in the control and operation of robots, unmanned vehicles, mobile sensor networks, and the smart grid [3]. Recently, many researchers have devoted their efforts to leveraging reinforcement learning techniques to multi-agent systems [4, 5, 6, 7]. Despite the remarkable advance in academia, multi-agent reinforcement learning is still difficult to be applied in practice. One non-trivial reason is that there always exists a gap between the simulated and real-world scenarios, which degrades the performance of the policies once the models are transferred into real-world applications [8].

To close this sim-to-real gap and accomplish more efficient policy transfer, multiple research efforts are therefore now being directed towards identifying the causes for such gap and proposing corresponding solutions. One main cause is the difference between the physics engine of the simulator and the real-world scenario. To alleviate the difference, research efforts have been directed towards building up more realistic simulators by mathematical models [9, 10, 11, 12, 13, 14]. Another cause is the mismatch of the data distribution of the simulation environment and the real environment, which has inspired related research on domain adaptation [15, 16, 17], and domain randomization [18].

Generally, the simulated systems always assume that the agents can work normally all the time. However, this assumption is usually not in line with reality. Due to inevitable hardware or software failures in practice, one or more agents may unexpectedly “crash" during the coordination process. If the agents are trained in an environment without crashes, they only master how to cooperate in a crash-free environment. Once some agents “break down" and take abnormal actions, the remaining agents can hardly maintain effective cooperation, which will lead to performance degradation. Take a two-agent system as an example: the two agents are required to finish two tasks in coordination; under the crash-free scenario, the optimal solution is that each agent takes responsibility for one task, respectively; when applying such a policy to the real-world application, the cooperation cannot be accomplished if any agent encounters a crash. This example indicates the necessity of considering unexpected crashes during training in order to obtain well-trained agents with high robustness.

To the best of our knowledge, this work is the first to study the crashes in the multi-agent systems, which is more consistent with real-world scenarios. In this study, we give a formal formulation of a cooperative multi-agent reinforcement learning system with unexpected crashes, where any agent has a certain probability of crashing during operation. We assume that, for each agent, the probability of crashing independently follows a Bernoulli distribution. To enhance the robustness of the system to unexpected crashes, the agents should be trained in an environment with crashes. The key challenge is how to adjust the crash rate during training.

In this work, we propose a coach-assisted multi-agent reinforcement learning framework, which introduces a virtual coach agent into the system. The coach agent is responsible to adjust the crash rate during training. One straightforward coaching strategy for “coach" is to set a fixed crash rate during training. Considering that it may be too difficult for agents to cooperate from scratch [19], increasing the crash rate gradually is another feasible strategy. In addition to these basic strategies, an experienced “coach" can also automatically adjust the crash rate corresponding to the overall performance during training. Specifically, if the performance exceeds the threshold, the crash rate is increased to increase the difficulty of learning; otherwise, the crash rate should be decreased. In this way, the agents can learn the coordination skills progressively in face of the unexpected crashes.

To test the effectiveness of our method, we conduct experiments on grid-world and StarCraft II micromanagement tasks. Compared to the fixed crash rate and curriculum Learning strategies, the results demonstrate that adaptive method achieves relatively stable performances in the case of different crash rates. Furthermore, the ablation study shows the efficacy of our re-sampling strategy.

2 Related Work

In this section, we briefly summarize the works related to cooperative multi-agent reinforcement learning (MARL). With the development of this field, researchers pay more and more attention to the MARL problem that is more consistent with real-world settings.

Early efforts treat the agents in a team independently and regard the team reward as the individual reward [20, 21, 22, 23]. In this way, the MARL task is transformed into multiple single-agent reinforcement learning tasks. While trivially providing a possible solution, these approaches pay insufficient attention to an essential characteristic of MARL—coordination among agents. In other words, it will bring non-stationarity that agents cannot distinguish between the stochasticity of the environment and the exploitative behaviors of other co-learners [24].

Another line of research focuses on centralized learning of joint actions, which can naturally take coordination problems into consideration [25, 26]. Most of the centralized learning approaches require communication during execution. For instance, CommNet [25] designs a centralized network for agents to exchange information. BicNet [26] leverages the bi-directional RNNs for information sharing. Considering the communication constraint in practice, SchedNet [27] is proposed, in which agents learn how to schedule themselves for message passing and how to select actions based on received partial observations. Another challenge of centralized learning is the scalability issue since the joint action space grows exponentially as the number of agents increases. Some works investigate the scalable strategies in the centralized learning [28, 29]. Sparse cooperative Q-learning [29] only allows the necessary coordination between agents by encoding such dependencies. However, these methods require prior knowledge of the dependencies among agents, which is often inaccessible.

To study a more practical scenario with partial observability and communication constraint, an emerging stream is the paradigm of centralized training with decentralized execution (CTDE) [30, 31]. To the best of our knowledge, value decomposition networks (VDN) [4] makes the first attempt to decompose a central state-action value function into a sum of individual Q-values to allow for decentralized execution. VDN simply assumes the equal contributions of agents and does not make use of additional state information during training. Based on VDN, QATTEN [32] utilizes a multi-head attention structure to distinguish the contributions of agents, and linearly integrates the individual Q-values into the central Q-value. Instead of using linear monotonic value functions, QMIX [5] and QTRAN [33] employ a mixing network satisfying Individual-Global-Max (IGM) principle [33] to combine the individual Q-values non-linearly by leveraging state information. QPLEX [6] introduces the duplex dueling structure and decomposes the central Q-value into the sum of individual value functions and a non-positive advantage function.

However, all of the existing works assume that agents can continuously maintain normal operations, which is inconsistent with real-world scenarios. As a matter of fact, it is a quite common phenomenon that some agents encounter unexpected crashes because of hardware or software failure. To this end, our work aims to study a more practical problem by considering unexpected crashed agents in the cooperative MARL task.

3 Problem Formulation

In order to better solve the problem of unexpected crashed agent, we define a Crashed Dec-POMDP model, which is defined by a tuple . Each agent has a probability of crashing and the crash rate is denoted as

. For simplicity, we assume that the crash happens at the beginning of the episode and the status of being crashed or not will not change throughout the episode. We define a binarized vector to denote the

crashed state of agents as , where . When the agent is crashed, is 1; otherwise, is 0. Note that stays the same during an episode but may change throughout the task due to randomness.

At each time step, each agent receives partial observation according to the observation probability function . Each uncrashed agent chooses an action with the normal strategy, while the crashed agents take “no-move" or random actions, forming a joint action . Given the current state , the joint action of the agents transits the environment to the next state according to the state transition function . All of the agents share a team reward . The learning goal of MARL is to optimize every agent’s individual policy , where is an agent’s action-observation history, so as to maximize the team reward accumulation , where is a discount factor.

Figure 1: An overview of the adaptive framework.

4 Method

In this section, we present our coach-assisted multi-agent reinforcement learning framework for Crashed Dec-POMDP problem and explain the rationality of our design.

4.1 Overall Framework

To simulate the crash scenarios during training, we introduce a virtual agent into the system to act as a “coach". The “coach" is responsible for deciding the crash rate during training. At the beginning of each episode , the “coach" sets up a crash rate . We assume the probability of being crashed for each agent following a Bernoulli() distribution. Given the current crash rate , some of the agents become crashed and cannot take rational actions. Then, the multi-agent system with crashed agents are trained for steps to learn coordination. Afterwards, the “coach" can receive the performance of the agents under the current crash rate, denoted as , and reset the crash rate for the next episode.

4.2 Coaching Strategy

The main challenge for the “coach" is how to choose an effective crash rate during training. Here, we introduce three coaching strategies.

Fixed Crash Rate. The “coach" sets a fixed crash rate throughout the training process. The agents, of which some are crashed, are required to learn coordination skills from scratch.

Curriculum Learning. The “coach" linearly increases the crash rate during training. At the very beginning, the agents are trained under a crash-free environment. For the episode, the “coach" sets the crash rate to be , where

is a hyperparameter. In this way, the difficulty of cooperation increases gradually.

Adaptive Crash Rate. For the former two strategies, the “coach" does not take full advantage of the performance of the cooperative agents. An advanced strategy for the “coach" is to adaptively adjust the crash rate corresponding to the performance of the agents under the current crash rate. The basic idea is that if the agents can cooperate well and achieve acceptable performance under the current crash situation, the crash rate should be increased; otherwise, the crash rate should be decreased. The adaptive strategy can be formulated as follows:


where is a mapping function, refers to the performance of the current model and

represents the threshold of the performance of the specific evaluation metric. We can see that the fixed crash rate and the curriculum learning are the two special cases of the adaptive strategy.

For the fixed crash rate strategy,


For the curriculum learning strategy,


where .

In this work, we use the following adaptive function:


It is notable that our method is not limited to the selection of the above function. A more efficient adaptive function can be further investigated.

4.3 Re-sampling Strategy

Randomly sampling from a Bernoulli() distribution may cause the proportion of the crashed agents far beyond or fewer than the current crash ratio . Therefore, we employ a re-sampling strategy to ensure that the number of the crashed agents is no more than the upper bound of . Here, we explain the rationality behind the re-sampling strategy. For the samples with more crashed agents, it may be too difficult for the current model to learn the coordination skills, thus they are discarded. For the samples with fewer crashed agents than the expectation, they can help the agents remember how to deal with the easier scenarios, thus they are utilized during training.

5 Experiments

In this section, we conduct experiments to demonstrate the effectiveness of the methods that we propose. We firstly conduct experiments in a grid-world environment as a toy example. Then we use StarCraft Multi-Agent Challenge (SMAC) environment  [34] as the test-bed to evaluate our methods, which has become a commonly-used benchmark for evaluating state-of-the-art MARL approaches. All experiments are conducted on a Ubuntu 18.04 sever with 4 Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz and GeForce RTX 2080Ti GPU. Our codes are available at

5.1 Grid-world Example

5.1.1 Settings

We utilize the grid-world example to intuitively show the consequence without considering unexpected crashes in real-world scenarios. We set a 1010 grid where two agents need to touch two buttons within limited steps. The game will terminate after 20 steps or both buttons have been touched. The default reward at each step is -1 and if one button is touched, the agents are assigned a reward of 5 at this step. In this way, the agents are encouraged to touch the button as quickly as possible. At each step, each agent has 5 possible actions including up, down, left, right and staying still. If there exists an unexpected crash during the test, the crashed agent will stay still through the whole episode so only one agent may crash during the test. For simplicity, the initial locations of agents and buttons are fixed so that this environment is deterministic. What’s more, the observation of each agent is its own location and the global state contains the locations of the two buttons and agents so that the agents will not know whether their partner is crashed according to its own observation during operation.

We take QMIX [5], a state-of-the-art value-based MARL algorithm, as the base model in this toy example and we adopt the adaptive approach in comparison. Our implementation is based on the Pymarl algorithm library [34] and training schedules such as the optimizer and training hyperparameters are kept the same as the default ones used in Pymarl. Our method includes two additional hyperparameters: one is the performance threshold to decide whether to increase or decrease the crash rate during training; another is the learning rate of the crash rate to control the stepsize of adjustment of . We set as 0.75 and as 0.01 in this experiment. These tasks are trained for 2 million steps separately.

(a) Agent1 is crashed
(b) Agent2 is crashed
Figure 2: The trajectory of agents during test when one of them is crashed. The agents are represented with circles while the crashed one is marked using dotted line and the normal one is marked using solid line. The colored grids symbolize the two buttons. The green arrow line is the trajectory of agents trained using original QMIX and the orange one is achieved by our method.

5.1.2 Performance Evaluation and Discussion

After training for the same steps, agents trained using both these methods manage to complete this task under normal scenarios. However, when an unexpected crash happens, things are different. The results are illustrated in Figure 2. The agent trained using QMIX learns to touch the button near it in the shortest path but after that, it only wanders meaninglessly. Due to partial observation, the normal agent fails to know that its partner is out of control and it also does not try to touch another button. We assume that agents trained by QMIX learn to efficiently cooperate to complete this task, therefore they just need to touch the button near themselves. However, their excessive reliance on cooperation makes the system fragile and they fail to deal with the unexpected crash, which is common in realistic scenes. In contrast, our method takes possible crashes into account during training so that it can still fulfill the task even when one of them “breaks down”. This toy example illustrates the drawback of overreliance on cooperation and the necessity of considering possible crashes when training the multi-agent system.

Figure 3: The illustration of the actions taken by the agents who are trained using original QMIX (on the left) and our adaptive approach (on the right) when they are tested under scenarios with crashes. The agent highlighted with a red rectangle represents the crashed one and the agent highlighted with a yellow rectangle refers to the one that is affected.

5.2 StarCraft Multi-Agent Challenge

5.2.1 Settings

Apart from the experiment of grid-world, we also conduct experiments on StarCraft II decentralized micromanagement tasks to show the effectiveness of our method. In this environment, we assume the crashed agents will take random actions. Here, we also take QMIX [5] as the base model. Then we compare the performance of QMIX and our coach-assisted framework with “fixed crash rate”, “curriculum learning” and “adaptive crash rate” coaching strategy. Our implementation is also based on the Pymarl algorithm library [34] without changing the default training schedules. For the variants of QMIX with the fixed crash rate, we randomly sample the crashed agents with a Bernoulli distribution during each episode, thus the actual number of the crashed agents ranges from 0 to . In curriculum learning coach strategy, the crash rate increases from 0 linearly and the upper limit is set to 0.1 as we test the models in scenarios whose crash rate is at most 0.1. We set the two hyperparameters in {0.60, 0.65, 0.70, 0.75} and in {0.001, 0.003, 0.005, 0.015}, and select their optimal values based on grid search when adopting our adaptive method. We repeat the experiments under each setting over 5 runs with different seeds and report the average results. For all the compared methods, each task is trained for 2 million steps separately. In order to obtain a relatively robust evaluation result, each model is tested for 128 times.

We choose two standard maps and design two different maps in the experiment, which are 3svs5z, 3s5zvs3s5z, 8mvs5z and 8svs3s5z. The two standard maps are well-matched in strength so a crash may result in some imbalance. In order to comprehensively show the performance of our method, we also design two maps that guarantee an appropriate gap in strength between the two sides so that unexpected crashes will not lead to a significant change in difficulty. To know more details about the maps, please refer to [34].

5.2.2 Observation

In this part, we discuss the observations from the scenario with crashed agents on StarCraft II micromanagement tasks and show what to be considered to deal with the crash scenario.

The agents in Figure 3 and 3 play as Marines (ours) that are good at long-range attacks while Zealots (opponents) can only attack in short-range and they have the same moving speed. Because Marines only have half health point of Zealots, the optimal strategy is to alternate fire to attract the enemies. Figure 3 shows the case when one agent (highlighted with a red rectangle) is out of control and starts to take random actions, one of the remaining agents (highlighted with a yellow rectangle) is disrupted so that it cannot take reasonable action as well. This case illustrates that the random crashes of some agents will undermine the coordination among the rest of the agents in the team, which is likely to cause a drop in the win rate. However, it can be observed in Figure 3 that agents trained with our method can avoid such effect of the unexpected crash as they may be familiar with abnormal observations.

Figure 3 and 3 describe another situation where Stalkers (ours) play against Zealots as well as Stalkers (opponents) in the map 8svs3s5z. Stalkers are good at long-range attacks while Zealots are skilled in short-range attacks, and Stalkers move faster than Zealots. Stalkers can win the game by simply attacking when the number of normal agents is sufficient, and they will fail if they just keep the same strategy under crashed scenarios. Figure 3 illustrates that the Stalkers trained by QMIX actually only learn to attack continuously because this simple policy can achieve good performance under normal scenarios. But if they can split into two groups, i.e. some of them attract Zealots and do kitting (, attack and step back) repeatedly while others focus fire to eliminate the Stalkers and then attack remaining enemies together, they are likely to achieve better performance, as shown in Figure 3. This case indicates that once a simple winning strategy exists, the learning algorithm has little incentive to explore other optimal strategies, leading to poor capability in the event of crashed agents. The observation implies that increasing the challenge during training may drive the agents to learn better policies.

Method 3svs5z 3s5zvs3s5z
crash rate-0.01 crash rate-0.05 crash rate-0.10 crash rate-0.01 crash rate-0.05 crash rate-0.10
Baseline 67.0 15.5 61.9 15.7 56.9 15.3 85.3 11.0 64.8 8.3 43.6 11.3
Fix-0.01 84.8 11.8 74.7 14.5 72.0 13.8 86.9 2.1 63.9 1.7 45.8 2.0
Fix-0.05 84.8 7.5 78.0 12.6 72.7 9.6 86.3 3.1 65.8 2.9 46.9 6.2
Fix-0.10 86.9 8.0 81.1 5.4 74.8 8.2 83.6 3.0 64.8 7.0 48.1 4.2
Curriculum 84.4 6.0 81.1 8.1 74.7 5.8 87.8 2.5 66.1 2.9 48.0 1.6
Adaptive- 82.5 8.5 77.8 8.4 71.6 10.3 85.9 4.5 65.2 3.0 46.1 1.9
Adaptive 88.6 3.6 83.3 6.5 79.2 6.7 88.0 3.2 67.0 2.4 51.7 2.2
Method 8mvs5z 8svs3s5z
crash rate-0.01 crash rate-0.05 crash rate-0.10 crash rate-0.01 crash rate-0.05 crash rate-0.10
Baseline 94.1 2.3 82.3 4.5 71.6 2.9 88.6 5.7 75.8 7.0 68.6 4.9
Fix-0.01 86.9 5.4 79.8 5.0 65.6 4.4 87.5 5.8 77.3 6.6 62.0 7.3
Fix-0.05 89.1 2.4 84.2 4.4 68.4 6.4 91.1 6.4 80.0 7.2 66.7 7.2
Fix-0.10 90.0 5.0 83.9 7.1 78.0 4.8 88.9 9.4 79.7 11.3 70.0 9.6
Curriculum 94.1 1.8 82.5 2.8 72.3 2.9 92.0 2.9 79.8 5.4 66.6 5.8
Adaptive- 91.3 4.1 84.8 2.1 78.6 3.1 91.7 3.1 80.0 4.6 69.1 6.0
Adaptive 94.2 2.2 89.4 2.4 81.1 3.1 93.9 4.5 84.5 10.0 71.3 12.2
Table 1:

The performance of the compared methods in terms of win rate (including mean and standard deviation) under different crash rates. Fix-i represents the variants of QMIX which indicates that the crashed rate is fixed to

during training. Adaptive represents the results gained by adopting our adaptive method, but without the re-sampling strategy.

5.2.3 Performance Evaluation and Discussion

We evaluate the performance of the comparison methods by testing the win rate under different crash rates and the results are shown in Table 1. It can be observed that in standard maps, even just using a simple fixed crash rate strategy can help improve the performance. In contrast, in our designed maps, such approach works badly when the crash rate is low. We assume that it is because the maps we designed are relatively simple so that even original MARL algorithms can handle the scenarios with a low crash rate. In this case, fixing a low crash rate may instead introduce noise which affects the learning process. But in scenarios with a high crash rate, such method still has a positive effect. As for curriculum learning strategy, it tends to perform well in scenarios with a low crash rate. To sum up, these two straight methods can help the system be more robust in face of unexpected crash to some degrees but they all have some limits. On the contrary, our adaptive approach can help improve the performance in different maps and crash rates, which demonstrates the effectiveness and generalization of our approach.

When compared with the baseline algorithm, our adaptive method tends to gain a greater margin when the crash rate increases, indicating the superiority of our adaptive strategy in dealing with unexpected crashes. This finding further implies the rationality of our adaptive strategy that allows the agents to learn how to handle the crash scenarios step by step. What’s more, it can be observed that the performance achieved by our method with re-sampling has a consistent superiority, compared to the performance achieved without adopting this strategy. We think it can be attributed to the fact that without re-sampling strategy, there may be samples which contain more crashed agents, thus bringing more difficulty during training. This finding also proves the importance of adopting re-sampling strategy in our coach-assisted framework.

5.2.4 Hyperparameter Analysis

In our adaptive framework, the performance threshold and the learning rate of crash rate , which jointly decide the updating of the adaptive crash rate, are of vital importance to the performance of our method. In this part, we further analyze the influence of these two hyperparameters on the overall performance, with other parameters unchanged.

crash rate-0.01 crash rate-0.05 crash rate-0.10 crash rate-0.01 crash rate-0.05 crash rate-0.10
84.7 9.0 80.5 9.6 74.1 8.3 80.9 9.4 75.6 6.9 65.0 13.0
88.6 3.6 83.3 6.5 79.2 6.7 88.6 3.6 83.3 6.5 79.2 6.7
81.4 7.8 77.7 6.2 72.3 7.1 88.4 4.5 83.6 4.7 75.5 8.6
79.5 10.2 77.8 8.5 67.0 8.2 78.4 7.8 73.4 8.5 68.9 6.0
Table 2: The impact of performance threshold and learning rate . The results show the win rate (including mean and standard deviation) under different settings across five different random seeds. The experiment takes QMIX as the base model on the 3svs5z task after 2 million steps of training.

Here, we take the map 3svs5z as an example. Table 2 reports the results of our method under different values of and . Given the same , a large means that we require the agents to learn quite well under the current crash rate before exploring a harder scenario. We can see that given , the overall win rate first increases and then decreases as increases from 0.6 to 0.75, and the best performance is achieved when . Given the same , we can see that the performance first improves and then degrades as the increases. The reason may be that, if is too small, the crash rate will be adjusted too slow, so that the agents cannot learn well within the limited steps. If is too large, sharply increasing the crash rate may be too difficult for agents to learn coordination and the adjustment of the difficulty will be rough. To sum up, the hyperparameters indeed have some effect on our framework, but our method can achieve a relatively stable performance if the hyperparameters are varied in a small range, which proves the robustness of our method.

6 Conclusion

Considering a common phenomenon that some agents may unexpectedly crash in real-world scenarios, this work is dedicated to a coach-assisted MARL framework that can close this sim-to-real gap. Our method simulates different rates of random crashes during the training process with the help of “coach” so that the agents can master the skills to deal with crashes. We conduct the experiments on grid-world and StarCraft II micromanagement tasks to show the necessity of considering crash during operation and test the effectiveness of our framework using three coaching strategies. The results demonstrate the efficacy and generalization of our method under different crash rates. In the future, we will further investigate the case in which the crashed agents may take other abnormal actions in addition to the random ones and other more efficient coaching strategies.


  • [1] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
  • [2] Karl Tuyls and Gerhard Weiss. Multiagent learning: Basics, challenges, and prospects. AI Magazine, 33(3):41–41, 2012.
  • [3] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Decentralized multi-agent reinforcement learning with networked agents: recent advances. Frontiers of Information Technology & Electronic Engineering, 22(6):802–814, 2021.
  • [4] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2085–2087, 2018.
  • [5] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In

    Proceedings of the International Conference on Machine Learning (ICML)

    , pages 4295–4304. PMLR, 2018.
  • [6] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. In International Conference on Learning Representations (ICLR), pages 1–16, 2020.
  • [7] Yun-peng Wang, Kun-xian Zheng, Da-xin Tian, Xu-ting Duan, and Jian-shan Zhou. Cooperative channel assignment for vanets based on multiagent reinforcement learning. Frontiers of Information Technology & Electronic Engineering, 21(7):1047–1058, 2020.
  • [8] Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pages 737–744. IEEE, 2020.
  • [9] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pages 621–635. Springer, 2018.
  • [10] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
  • [11] Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. Rotors—a modular gazebo mav simulator framework. In Robot operating system (ROS), pages 595–625. Springer, 2016.
  • [12] Cassandra McCord, Jorge Pena Queralta, Tuan Nguyen Gia, and Tomi Westerlund. Distributed progressive formation control for multi-agent systems: 2d and 3d deployment of uavs in ros/gazebo with rotors. In European Conference on Mobile Robots (ECMR), pages 1–6. IEEE, 2019.
  • [13] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
  • [14] Yunpeng Wang, Kunxian Zheng, Daxin Tian, Xuting Duan, and Jianshan Zhou.

    Pre-training with asynchronous supervised learning for reinforcement learning based autonomous driving.

    Frontiers of Information Technology & Electronic Engineering, 22(5):673–686, 2021.
  • [15] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning (ICML), pages 1480–1490. PMLR, 2017.
  • [16] René Traoré, Hugo Caselles-Dupré, Timothée Lesort, Te Sun, Natalia Díaz-Rodríguez, and David Filliat. Continual reinforcement learning deployed in real-life using policy distillation and sim2real transfer. In ICML Workshop on “Multi-Task and Lifelong Reinforcement", pages 1–11, 2019.
  • [17] Karol Arndt, Murtaza Hazara, Ali Ghadirzadeh, and Ville Kyrki. Meta reinforcement learning for sim-to-real domain adaptation. In IEEE International Conference on Robotics and Automation (ICRA), pages 2725–2731. IEEE, 2020.
  • [18] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.

    Domain randomization for transferring deep neural networks from simulation to the real world.

    In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
  • [19] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21:1–50, 2020.
  • [20] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the International Conference on Machine Learning (ICML), pages 330–337, 1993.
  • [21] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [22] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the International Conference on Machine Learning (ICML), pages 2681–2690. PMLR, 2017.
  • [23] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 1146–1155. PMLR, 2017.
  • [24] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems (NeurIPS), 30:6379–6390, 2017.
  • [25] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus.

    Learning multiagent communication with backpropagation.

    In Annual Conference on Neural Information Processing Systems (NeurIPS), page 2252–2260, 2016.
  • [26] Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. In arXiv preprint arXiv:1703.10069, pages 1–10, 2017.
  • [27] Daewoo Kim, Sangwoo Moon, David Hostallero, Wan Ju Kang, Taeyoung Lee, Kyunghwan Son, and Yung Yi. Learning to schedule communication in multi-agent reinforcement learning. In International Conference on Learning Representations (ICLR), pages 1–17, 2018.
  • [28] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. In Annual Conference on Neural Information Processing Systems (NeurIPS), volume 1, pages 1523–1530, 2001.
  • [29] Jelle R Kok and Nikos Vlassis. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research (JMLR), 7:1789–1828, 2006.
  • [30] Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate q-value functions for decentralized pomdps.

    Journal of Artificial Intelligence Research

    , 32:289–353, 2008.
  • [31] Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82–94, 2016.
  • [32] Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. Qatten: A general framework for cooperative multiagent reinforcement learning. In arXiv preprint arXiv:2002.03939, pages 1–14, 2020.
  • [33] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 5887–5896. PMLR, 2019.
  • [34] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr, Jakob Foerster, and Shimon Whiteson. The StarCraft Multi-Agent Challenge. CoRR, abs/1902.04043, 2019.