An Introduction to Multi-Agent Reinforcement Learning and Review of its Application to Autonomous Mobility

by   Lukas M. Schmidt, et al.

Many scenarios in mobility and traffic involve multiple different agents that need to cooperate to find a joint solution. Recent advances in behavioral planning use Reinforcement Learning to find effective and performant behavior strategies. However, as autonomous vehicles and vehicle-to-X communications become more mature, solutions that only utilize single, independent agents leave potential performance gains on the road. Multi-Agent Reinforcement Learning (MARL) is a research field that aims to find optimal solutions for multiple agents that interact with each other. This work aims to give an overview of the field to researchers in autonomous mobility. We first explain MARL and introduce important concepts. Then, we discuss the central paradigms that underlie MARL algorithms, and give an overview of state-of-the-art methods and ideas in each paradigm. With this background, we survey applications of MARL in autonomous mobility scenarios and give an overview of existing scenarios and implementations.



page 1


Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents

This article reviews recent advances in multi-agent reinforcement learni...

Agent-aware State Estimation in Autonomous Vehicles

Autonomous systems often operate in environments where the behavior of m...

Distributed Constraint Optimization Problems and Applications: A Survey

The field of Multi-Agent System (MAS) is an active area of research with...

UW-MARL: Multi-Agent Reinforcement Learning for Underwater Adaptive Sampling using Autonomous Vehicles

Near-real-time water-quality monitoring in uncertain environments such a...

Multi-Agent Pathfinding: Definitions, Variants, and Benchmarks

The MAPF problem is the fundamental problem of planning paths for multip...

An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective

Following the remarkable success of the AlphaGO series, 2019 was a boomi...

Multi-Agent Reinforcement Learning for Dynamic Routing Games: A Unified Paradigm

This paper aims to develop a unified paradigm that models one's learning...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

AI is increasingly being deployed in many applications and devices, which to many parts, do not act in isolation but in a connected and cooperative environment. This is particularly relevant in autonomous mobility and traffic scenarios, like the ones shown in Fig. 1 where multiple agents need to interact with each other to find a common solution. As cooperation and communication between the agents are important in these problems, single-agent solutions, such as RL often fall short of expectations. Instead, MARL specifically deals with multi-agent problems and aims to find policies that help multiple vehicles reach their individual and common objectives.

MARL aims at finding optimal strategies for agents in settings where multiple agents interact in the same environment. This allows a variety of new solutions that build on concepts such as cooperation [37] or competition [35]. However, multi-agent settings also introduce challenges, including potentially missing or inadequate communication, difficulties in credit assignment to agents, and environment non-stationarity. This survey aims at highlighting key principles and paradigms in MARL research and gives an overview of possible applications and challenges for MARL in autonomous mobility.

The rest of the paper is structured as follows. We first provide basics of RL in Sec. II before we explain central concepts of MARL in Sec. III. Sec. IV summarizes current state-of-the-art of MARL algorithms and explains their core paradigms. Sec. V then gives an overview of four large application fields of MARL. We give an overview to implementation frameworks in the relevant sections. Sec. VI concludes.

Fig. 1: MARL applications: Intelligent warehouses and logistics, drone fleets for delivery and agriculture, traffic flow control and automated driving (image source: Pexels/Pixabay).

Ii Basics: Reinforcement Learning

Single-agent RL is designed to solve a class of problems known as Markov Decision Processes (MDPs). In an MDP, a single agent interacts with an environment by taking certain actions depending on the state of the environment. These actions may change the state of the environment. The agent receives a reward and aims at maximizing the expected return, i.e., the sum of rewards gained in the environment discounted by a factor (that weights future rewards) by finding an optimal policy (i.e., the behavior; denotes optimality) that maps states to actions. A single step in the environment is known as a transition sample (, , , [51].

We describe an MDP by a quin-tuple where and refer to state and action space,

denotes the transition probability from a state

to a successor state given an action , denotes the reward function received by the agent for the transition from to , and is the discount factor that trades instantaneous over future rewards.

Two major paradigms in RL are value-based and policy gradient methods. Value-based methods, such as DQN [33]

, estimate the utility of an action using a parametrized Q-network

; the policy is then formed by greedily following the actions with the highest estimated values. Instead, policy gradient methods directly optimize a parametrized policy to make actions that lead to a high reward more likely. Recent work in RL, e.g., DDPG [27] or PPO [44] combines both paradigms into an actor-critic architecture that uses a value-based critic to improve updates to the actor (or policy).

Iii Multi-Agent Reinforcement Learning

In MARL, multiple agents are concurrently optimized to find optimal policies. Compared to the single agent RL settings, this introduces several important differences and possibilities, which we want to highlight in this section.

Iii-1 Observability in MARL

In RL, the concept of observability describes whether an agent can perceive the entire environment state fully or partially. A classical MDP is defined to be fully observable: The agent directly perceives the environment state . Partial observability is described by a POMDP, and modelled by adding a (potentially lossy) observation function that converts from an environment state to an observation . POMDP often present more challenging, real-world applications. For applied research, some scenarios (i.e., the movement of robots inside a controlled warehouse environment with enough sensors) can be assumed to be fully observable. However, most real-world mobility applications, like traffic scenarios, are partially observable.

Multi-Agent RL provides the opportunity to combine partial observations from multiple agents to gather more information. Several paradigms of MARL research, like communication or cooperation, explicitly or implicitly use this opportunity to improve performance. Many other MARL methods use Recurrent Neural Networks (RNNs) to aggregate observations over time into more complete observations 


Iii-2 Centrality

Different MARL settings can be distinguished by how much agents are able to communicate with each other. In a fully decentralized problem, agents have no means of communication, while a fully centralized setting allows one central entity to perceive and control for all agents. This distinction affects whether the state and action spaces for all agents are disjoint (decentralized) or joint (fully centralized). Most MARL algorithms are neither decentralized nor fully centralized; we explain different communication and interaction paradigms in Sec. IV.

Iii-3 Heterogenous vs. Homogenous agents

In general, MARL does not require all agents to be similar. Heterogeneous agents can have different observation and action spaces, like cars and traffic lights. However, in many MARL environments, the agents are homogeneous, meaning they have the same state and action space. For example, in an autonomous driving setting, many cars require similar behavior. This can be exploited by introducing parameter sharing in training, so effectively jointly learning the policies. This is advantageous as the amount of trainable parameters is reduced, allowing for shorter training times. More efficient training is enabled as experiences of all agents are used. However, different behavior at execution time is still possible because each agent has unique observations.

Iii-4 Cooperative vs. Competitive

An important difference between MARL environments is how the goals of each agent relate to each other. This can be divided into fully cooperative, fully competitive and mixed cooperative-competitive.

In cooperative settings, all agents aim to achieve a common goal. For example, all agents want to reach their destination undamaged. Usually, all agents share a common reward function [63]. This work is focused on cooperative MARL, because autonomous mobility scenarios are best characterized by shared goals (e.g., reaching each agent’s destination unharmed) than by competitive objectives.

By contrast, competitive settings, e.g., card or board games, are usually modeled as zero-sum Markov games where the return of all agents sums to zero [3]. Mixed settings combine cooperative and competitive agents that use general-sum returns. For instance, in team games agents have to cooperate with their teammates while competing with the opposing teams [3]. Littmann et al. [28] state that optimal play can be guaranteed against an arbitrary opponent in competitive settings, whereas strong assumptions must be made to guarantee convergence for cooperative environments.

Iii-5 Scalability

While simple MARL problems have a relatively small number of agents, many real-world mobility scenarios involve hundreds or thousands of individual traffic participants. Modern simulators, like SUMO [30] or CityFlow [62], can simulate entire neighborhoods or small cities and model how hundreds of traffic lights affect the traffic flow. Consequently, algorithms have to scale correctly and efficiently to these large agent numbers. As an example, a simple centralized entity that uses the joint observation and action space of all agents would be impossible to train, while independent agents (although less performant for a smaller number of agents) could still operate. This problem is known as scalability and we discuss potential solutions in Sec. IV.

(a) Independent
(b) Credit assignment
(c) Communication
(d) CTDE
Fig. 2: Four categories of MARL approaches based on the communication and cooperation: (a) Independent agents (colored circles) operate and learn independently; (b) Credit assignment techniques use an explicit mechanism that assigns the reward (grey circles) to individual agents; (c) Agents operate independently but have access to a communication medium to exchange messages (black arrows). (d) In CTDE (CTDE), during training, agents use a shared critic (gray background) to improve performance. After training, the critic is no longer used and agents operate independently.

Iii-6 Global vs. local rewards

Single-agent RL environments have a single reward definition that specifies the objective for the agent. Cooperative multi-agent settings can either have a single global reward, or multiple local rewards, one for each agent [53]. This is an important difference: global rewards only depend on the global state. They are easy to compute and specify, but often lead to inefficient training, because agents are never explicitly rewarded or penalized for their own actions [56]. Local rewards, on the other hand, can reflect each individual agent’s contribution to the cooperating group. This makes it easier to learn and improves convergence, especially in decentralized settings that do not have other explicit cooperation [1]. However, it is usually challenging to design and implement a reward function that correctly rewards individual agents for their contributions [16]. Algorithms that use the credit assignment paradigm described in Sec. IV-B explicitly learn how global rewards can be disentangled into local rewards to solve this.

Iv Algorithms for MARL

We organize algorithms for MARL into four categories based on their underlying paradigm, as illustrated in Fig. 2. We order them from independent to fully connected agents. The four paradigms described in the following paragraph are: Fully decentralized algorithms, credit assignment, communicating agents, algorithms with centralized training and decentralized execution. All algorithms surveyed in this work are summarized in Table I

. In addition to characteristics of paradigm and algorithm type, the table also lists available open-source implementations of these algorithms.

Name Year Communication Type Framework
Independent Agents, Section IV-A
IQL [52] 1993 No V EpyMARL11footnotemark: 1, RLLib22footnotemark: 2, MALib44footnotemark: 4
IA2C [6] 2020 No AC EpyMARL11footnotemark: 1, RLLib22footnotemark: 2, MALib44footnotemark: 4
IPPO [8] 2020 No AC EpyMARL11footnotemark: 1, RLLib22footnotemark: 2, MALib44footnotemark: 4
Credit Assignment, Section IV-B
COMA [11] 2018 No AC EpyMARL11footnotemark: 1
Shapley counterfactual credit assignment [23] 2021 No Value -/-
VDN [50] 2018 No V EpyMARL11footnotemark: 1, Mava33footnotemark: 3
QMIX [12] 2018 No V EpyMARL11footnotemark: 1, RLLib22footnotemark: 2, Mava33footnotemark: 3, MALib44footnotemark: 4
MAVEN [32] 2019 No Hybrid Author’s implementation55footnotemark: 5
Q-DPP [59] 2020 No V Author’s implementation66footnotemark: 6
Communication, Section IV-C
RIAL [10] 2016 Yes V Author’s implementation77footnotemark: 7
DIAL [10] 2016 Yes V Mava33footnotemark: 3, NMARL88footnotemark: 8
CommNet [49] 2016 Yes P NMARL88footnotemark: 8
ATOC [18] 2018 Yes AC -/-
IC3Net [47] 2019 Yes P Author’s implementation99footnotemark: 9
NeurComm [5] 2020 Yes AC NMARL88footnotemark: 8
Mean-field MARL [58] 2018 No V/AC Author’s implementation1010footnotemark: 10
Dec-Neural-AC [14] 2021 No AC -/-
Centralized Training, Decentralized Execution, Section IV-D
MADDPG [31] 2017 No AC EpyMARL11footnotemark: 1, RLLib22footnotemark: 2, MALib44footnotemark: 4, Mava33footnotemark: 3
MAPPO [61] 2021 No AC EpyMARL11footnotemark: 1, Mava33footnotemark: 3
Recurrent policy optimization [20] 2021 No AC Authors implementation1111footnotemark: 11

11footnotemark: 1 22footnotemark: 2 33footnotemark: 3 44footnotemark: 4 55footnotemark: 5 66footnotemark: 6 77footnotemark: 7 88footnotemark: 8 99footnotemark: 9 1010footnotemark: 10 1111footnotemark: 11
TABLE I: We sort algorithms into the paradigms explained in Fig.2. Additionally, we note if algorithms make use of an explicit communication mechanism between agents. The type is either value-based (V), policy-based (P) or actor-critic (AC).

Algorithm Frameworks. Alongside many publications, codebases are published. Additionally, several frameworks providing implementations for multiple methods are publicly available. RLLib [26] is an open-source reinforcement learning library that aims at providing high-performance implementations for RL algorithms. Many single-agent RL algorithms are implemented which can be extended to the multi-agent setting. Moreover, some MARL-specific algorithms are available as well as parameter sharing. EpyMARL is another framework comprising a wide range of algorithms and parameter sharing. However, only discrete actions are available [40]. Mava is designed for scalable MARL training and execution. It allows for discrete and continuous actions [41]. MALib is specialized in scalable population-based MARL [67]. Networked MARL (NMARL) is a repository of MARL algorithms for networked system control [5]. Table II lists the implementation framework, if available, as well.

Iv-a Decentralized: Learning independently

The fully decentralized setting assumes that agents learn entirely independently. Usually, multiple agents learn individual single-agent RL algorithms concurrently. To that end, each agent has an independent observation, policy and algorithm - the only interaction exists through the environment. Algorithms of this kind are often used as a baseline and include IQL [52], IA2C [6] or IPPO [8]. Typically, independent agents perform worse than optimized cooperating agents in centralized settings. However, the performance benefits of more complicated approaches vary wildly between environments - in some cases, independent agents are on par with or can outperform cooperating agents [52, 8].

While it is easy to adapt single-agent RL algorithms to these settings, decentralized environments violate the stationary Markov property (i.e., the probability transitions , which define the system dynamics, change). In particular, all other agents are considered part of the environment from an agent’s perspective. Therefore, contrary to single-agent RL, convergence is not guaranteed [15]. However, decentralized algorithms do not need to deal with scalability issues.

Iv-B Credit assignment

Most MARL environments do not explicitly disentangle reward functions for different agents, and use global rewards instead. This is problematic during training [56]. The idea behind credit assignment methods is to convert this global reward into an estimated, per-agent local reward. They can be grouped into explicit and implicit credit assignment methods.

Explicit methods estimate each agent’s contributions against a certain baseline. COMA [11] proposes a counterfactual baseline, where the global reward is compared to the reward received if the agent’s action is replaced by a default action. This approach omits correlations between the agents, leading to inefficiency in complex settings. To circumvent this problem, Li et al. [23] introduce an approach called Shapley counterfactual credit assignment

that guides the training by estimating the individual credit for each agent using Shapley Values. Shapley Value is a concept from cooperative game theory quantifying the contribution of each player to the coalition


Implicit credit assignment aims at learning how to decompose the shared reward function into individual value functions [65]. Sun et al. introduce a value decomposition network (VDN), which learns to decompose the shared value function into individual value functions [50]. However, it assumes additive individual functions. QMIX [12] builds on VDN and removes this limitation by decomposing the global reward into an arbitrary non-linear combination of per-agent value functions. To make the optimization tractable, QMix only constrains this combination to be monotonous with respect to the contribution of individual agents [12].

While this monotony constraint allows a decentralized execution, Mah et al. [32] argue that it does not hold for arbitrary Q-functions and makes exploration inefficient. They introduce a hierarchical method, i.e., multi-agent variational exploration (MAVEN), that introduces a central latent space and diversifies exploration among agents [32]. However, recent work [59] argues that this exploration is still inadequate and propose a linear-time sampler that helps agents cover orthogonal directions in state space during training.

Iv-C Learning Communication

To allow agents to coordinate, a large body of work introduces an explicit communication layer to the algorithms. This information exchange allows agents to widen their observation space and coordinate. We discuss algorithms that use explicit communication between agents.

Foerster et al. [10] proposed two approaches for learning communication between independent agents. In Reinforced Inter-Agent Learning (RIAL) each agent has an additional network to generate a communication message for the other agents. For Differentiable Inter-Agent Learning (DIAL), the gradients of the other agents are fed through the communication channel into the communication network. This allows for end-to-end training across the agents whereas RIAL is only trainable within an agent. In contrast to discrete connections in RIAL and DIAL, CommNet [49] learns a continuous communication channel jointly with the policy. Through this channel, the agents receive the summed transmission of the other agents. Then, the input of each hidden layer for each agent is the previous layer and the communication message.

In recent work, communication has been viewed in a more ambiguous way. Communication introduces costs, especially as the number of agents grows. Additionally, communication massively increases the amount of information available to each agent, which can make it harder to identify crucial information [18]. Recent work has thus focused on different mechanisms to decide if an agent should communicate at all.

Jiang and Lu [47] propose an attentional communication model (ATOC) that uses local information to decide if an agent should initiate communication. Agents form a dynamic group of local communicating agents that can coordinate with a shared communication medium. In a different line of work, the individualized controlled continuous communication model (IC3Net) [47] uses RNN policies and allows agents to share the RNN’s hidden state through a designated action. This is particularly useful in competitive scenarios as agents can block communication.

Another approach to restricting communication is networked MARL (NMARL). Contrary to attentional communication, these networks limit the agents’ communication to their local neighbors. Chu et al. [5] employ this concept and additionally introduce a communication protocol called NeurComm. It is similar to CommNet [49] but instead of summing the received messages, NeurComm concatenates them. Communication is supplemented by policy fingerprints to reduce non-stationarity [5].

Representing each agent as a node in a network enables scaling to large agent populations. Yang et al. [58] propose employing this mean-field formulation to solve scalability issues. Accordingly, each agent is only affected by the mean effect of its local neighborhood. Compared to considering all agents, the complexity of interactions is reduced significantly for large agent populations. Another mean-field approach called decentralized neural actor-critic algorithm (Dec-Neural-AC) [14] aims at scalability. During training, the agents incorporate information about their local neighborhood. During execution, however, the agents only depend on their own current states. Thus, the learned policies are executed decentralized. This is similar to the CTDE introduced in Sec. IV-D.

Iv-D Centralized Training, Decentralized Execution (CTDE)

Modern MARL approaches often make use of centralization during training but create independent agents that run independently. This setting is known as CTDE. CTDE maps well to real-world use cases, where training can occur in simulation or with (potentially expensive) real-time communication, but agents should operate independently after training.

Many CTDE algorithms use an actor-critic architecture, which decomposes the policy and value estimates into explicit actor and critic networks. The critic is only used during training, and used to improve gradient updates for the policy. CTDE actor-critic algorithms exploit this by training multiple agents together with a shared critic. During execution, only the independent actors are needed, which allows agents to operate without communication after training.

MADDPG [31] is an example for CTDE algorithms that uses per-agent actors with policy networks to map agent-specific observations to agent-specific actions. The critic approximates a centralized action-value function and receives the concatenated actions of all agents, as well as additional global state information. This state information can include the concatenated observations of all agents. The agent’s actions are included to make the environment more stationary [31].

Off-policy methods, such as MADDPG, were commonly considered to be more sample-efficient than on-policy methods. However, Yu et al. [61] propose the on-policy algorithm MAPPO that performs comparable to off-policy algorithms. MAPPO suggests using global environment information as input to the critic instead of a concatenation of all observations, which scales with an increasing number of agents.

Recurrent Policy Optimization [20] is a recent method that uses a recurrent critic and adds a combinator that combines different agents’ trajectories into one meta-trajectory. This meta-trajectory is used to train the critic and allows the approach to better learn from interactions between agents.

V MARL in Autonomous Mobility: Applications and Challenges

Many real-world scenarios in autonomous mobility involve multiple agents. This presents an obvious potential for MARL approaches to optimize how agents cooperate and interact. We give an overview of application domains of MARL in autonomous mobility and discuss potential benefits, challenges, and recommendations for MARL-based methods. Table II lists different environments and their use-cases.

Environment Type Obs. Actions Description
General Purpose
MAgent11footnotemark: 1 [64] Comp P D Multiple environments with a large number of agents
Multi-Agent Particle11footnotemark: 1 [35] Coop / Comp P D / C Multiple environments focused on communication
Traffic Control
CityFlow [62] Coop / Comp P / F -/- City traffic simulator
SUMO [30] Coop / Comp P / F D / C Traffic simulation
Autonomous Vehicles
SMARTS22footnotemark: 2 [66] Coop P C Autonomous driving simulation platform
BARK33footnotemark: 3 [2] Coop P / F D / C Autonomous driving scenarios
MACAD44footnotemark: 4 [38] Coop / Comp P D / C Multi-Agent Connected Autonomous Driving built based on CARLA
highway-env55footnotemark: 5 [22] Coop P / F D / C Autonomous driving and tactical decision-making tasks
Unmanned Aerial Vehicles
Gym PyBullet Drones66footnotemark: 6 [39] Coop / Comp P C Simulation environment for quadcopters with OpenAI gym API
AirSim77footnotemark: 7 [45] Coop / Comp P C High-fidelity simulation environment for UAVs and autonomous vehicles
Resource Optimization
MARO77footnotemark: 7 [25] Coop P D / C Resource optimization in industrial domains
Flatland88footnotemark: 8 [34] Coop P / F D Vehicle re-scheduling problem
11footnotemark: 1 2 3 4 55footnotemark: 5 6 7 8
TABLE II: Simulation environments for general-purpose MARL research and four application fields summarized in Sec. V. Environments are characterized by either partial (P) or full (F) observability. Action spaces can either be discrete (D) or continuous (C). Slashes (A/B) denote that the environment can be configured to have different characteristics.

V-a Traffic Control

Traffic congestion in cities is problematic as it causes pollution, financial loss and increases the risk of accidents [55]

. Hence, controlling the traffic through traffic signals, line controls or routing guidance has great potential to improve living conditions. Traffic involves a great number of participants including traffic lights, vehicles, and pedestrians. MARL is an obvious fit for the traffic control problem as it does not rely on limiting heuristics. Data from various sources, e.g., road surveillance cameras, location-based mobile services, and vehicle tracking devices, can be input to MARL models 


Adaptive traffic signal control is a prominent approach to traffic control. A potentially large number of agents, however, demands solutions to problems related to the scalability issue. As centralized and CTDE approaches do not scale well for large agent populations, decentralized approaches are an evident choice. However, convergence is not guaranteed (Section IV-A). To make the environment stationary from the agents’ perspectives, Wang et al. [54] introduce graph attention networks to learn the dynamics of the neighbor intersection’s influences. Instead of attention networks, Chu et al. [6] propose communication to the neighboring agents to improve the convergence of decentralized training. In addition, a spatial discount factor is used to weigh the observation and reward signals of neighboring agents.

In essence, all approaches assume that neighboring agents are most relevant to control local traffic. However, training MARL models needs a lot of data that makes the availability of accurate simulators crucial. SUMO [30] and CityFlow [62] provide macroscopic city-scale simulations that can efficiently model many different agents and their interactions.

V-B Autonomous Vehicles

A different application directly controls multiple autonomous vehicles through MARL. Compared to independent single-agent RL, this improves the ability of agents to cooperate and achieve their individual objectives. In this domain, performance gains can come from coordinated driving (e.g., by reducing wind resistance, traffic jams, and optimizing road usage [5]), the ability to extend the own perception with information from other agents (through vehicle-to-vehicle (V2V) / vehicle-to-X (V2X) communication [13]), or through efficient interactions in intersection navigation scenarios [57].

A large variety of simulation and benchmark environments are available for research on autonomous vehicles. The Highway-Env [22] allows to control multiple vehicles and is particularly easy to set up and use. However, we found that performance can drop below acceptable levels for large-scale traffic simulations of many vehicles. BARK [2] is an open-source simulation framework for autonomous vehicle research that focuses on multi-agent scenarios with interactions between agents. BARK simulates various traffic scenarios and has a two-way interface to control cars hosted in CARLA, a widely-used AV research simulation framework. Similarly, SMARTS [66] provides a simulation environment for diverse driving interactions, is compatible with the OpenAI gym interface, and offers multiple different observations.

This line of work presents unique challenges: Autonomous vehicles must act in dynamic and unpredictable environments like urban or highway traffic, and safety constraints must be satisfied at any time. This limits the exploration crucial for RL. Post-hoc extraction of safe policies trained in a simulator [43] can be used to assure safety, but these methods have not yet been successfully applied to complex, multi-agent control settings. Moreover, most MARL methods assume identical or compatible algorithms. This requires manufacturers to agree on a common architecture and communication standard to facilitate cross-brand cooperation of vehicles.

V-C Unmanned Aerial Vehicles

Swarms of UAV are another popular application field. UAVs are especially beneficial for communication tasks [7] and applications where human lives are endangered. Here, a (potentially large) network of UAV needs to be controlled to fulfill tasks like forest fire surveillance [4], road traffic monitoring [9] and air quality monitoring [60]. Moreover, UAVs can be used to provide internet access in remote areas and enable wireless communication for applications like navigation and control [36].

Multiple works use MARL to plan paths and assign targets for UAVs, in a problem known as Multiple Target Assignment and Path Planning (MUTAPP). Here, MARL-based methods can outperform classical optimization techniques, like mixed-integer linear programming (MLIP), because they are able to handle dynamic environments and can operate in a decentralized way 

[42, 17]. Hu et al. [17] propose a combination of DQN [33] and DDPG [27] to allow the agents to use a combined discrete-continous action space. They deploy this algorithm in a decentralized way and report performance benefits over hardcoded algorithm implementations and simpler algorithms using only DQN or DDPG. Qie et al. [42] propose a solution based on MADDPG to jointly optimize target assignment and path planning. They propose a task abstraction layer that combines these tasks into a common reward structure. Their algorithm can effectively minimize the number of conflicts that arise in the MUTAPP problem.

In general, UAVs present a challenging problem for almost all algorithms. Typical environments for UAVs are highly dynamic [42] and require constant collision avoidance (and thus quick reaction times) between individual agents. In addition, UAVs have strict real-time constraints for their operation, which limits complex central path planning and possible architectures for MARL algorithms. [42]. Another major challenge is energy supply and the limited range for UAVs. Jung et al. [19] apply a solution based on CommNet [49] to coordinate the assignment of UAVs to charging towers and to share energy between towers, optimizing power draw from the electrical grid and minimizing operational costs.

Most research on UAVs uses custom, low-fidelity environments to simplify development [42, 19]. However, there are multiple high-fidelity simulators available that model UAV dynamics in more detail. AirSim [45] is a simulation platform based on the Unreal Engine with a special focus on realistic simulation and hardware-in-the-loop research. It supports detailed rendered camera observations. In addition to UAVs, AirSim is also able to simulate cars. Gym PyBullet Drones [39] presents a physical simulator for research on UAVs. In contrast to AirSim, Pybullet Drones features more realistic physical effects like the ground effect and downwash, and it is compatible with OpenAI gym and RLlib.

V-D Resource optimization

In another line of work, MARL is applied to resource optimization and scheduling problems for mobility scenarios. This includes scheduling for trains [34] or ambulances, but can also be applied to taxi repositioning [29], ride-sharing services [24], bike repositioning, or container inventory management [25]. Li et al. [24] propose to solve the assignment of ride requests to specific drivers with MARL, which improves order response rates.

A popular simulation used in the train scheduling and routing domain is Flatland, which has been used in several NeurIPS competitions [34]. Flatland provides a 2D-gridworld of train tracks. Agents (i.e., trains) have to find optimal policies that are safe, performant, and robust to unforeseen circumstances since trains can break down. Observations can be global or local, and Flatland also allows custom observation spaces. The Multi-Agent Resource Optimization platform [25] provides simulations, an RL toolkit, and utilities for distributed computing for this domain. It supports diverse scenarios across multiple application domains, including the aforementioned container inventory management and bike repositioning tasks, and includes visualizations into these simulation environments.

Vi Conclusion

In conclusion we would like to note several challenges and open problems in MARL research for mobility scenarios. Chief among these is a lack of explicit safety and interpretability affordances in almost all current algorithms. This introduces real risk in autonomous mobility scenarios, especially when autonomous vehicles or UAVs are controlled. Safe training and verification methods that certify policy safety [43] must be developed for deployment of these methods to the real world [21].

A second challenge is the integration with existing, manually controlled systems. This is evident in domains like traffic control or resource optimization, where human interaction and intervention (e.g., through emergency vehicles or manually prioritized resources) can come unexpected for agents that were only trained in a simulation. Effectively providing these interactions with human needs in simulation is an unsolved problem due to the low sample efficiency of RL.

Finally, most simulation environments assume perfect, high-bandwidth, latency-free communication. Despite recent advances in communication standards like 5G [13, 48], this is not a valid assumption in real-world scenarios. This makes more centralized solutions, like CTDE, harder or impossible to train in real-world situations. Recent methods that limit communication could be a possible solution to this problem.

Overall, however, MARL presents an exciting opportunity to learn solutions for complex problems involving multiple agents in an efficient, automated way.


This work was supported by the Bavarian Ministry for Economic Affairs, Infrastructure, Transport and Technology through the Center for Analytics-Data-Applications (ADA-Center) within the framework of “BAYERN DIGITAL II”. B.M.E. gratefully acknowledges support of the German Research Foundation (DFG) within the framework of the Heisenberg professorship program (Grant ES 434/8-1).


  • [1] T. Balch (1999) Reward and diversity in multirobot foraging. Work. Agents Learning About, From and With other Agents, pp. 92–99. Cited by: §III-6.
  • [2] J. Bernhard, K. Esterle, P. Hart, and T. Kessler (2020) BARK: open behavior benchmarking in multi-agent environments. In IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Las Vegas, NV, pp. 6201–6208. Cited by: §V-B, TABLE II.
  • [3] L. Canese, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Re, and S. Spanò (2021) Multi-agent reinforcement learning: a review of challenges and applications. Appl. Sciences 11 (11). External Links: ISSN 2076-3417, Document Cited by: §III-4.
  • [4] D. W. Casbeer, D. B. Kingston, R. W. Beard, and T. W. McLain (2006) Cooperative forest fire surveillance using a team of small unmanned air vehicles. Int. J. Syst. Sci. 37 (6), pp. 351–360. External Links: Document Cited by: §V-C.
  • [5] T. Chu, S. Chinchali, and S. Katti (2020) Multi-agent reinforcement learning for networked system control. arXiv preprint 2004.01339. Cited by: §IV-C, TABLE I, §IV, §V-B.
  • [6] T. Chu, J. Wang, L. Codeca, and Z. Li (2020-03) Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transportation Systems 21 (3), pp. 1086–1095. External Links: Document, 1903.04527, ISSN 15580016 Cited by: §IV-A, TABLE I, §V-A.
  • [7] J. Cui, Y. Liu, and A. Nallanathan (2019-05) The application of multi-agent reinforcement learning in UAV networks. In IEEE Int. Conf. on Communications Workshops, Shanghai, China. External Links: Document Cited by: §V-C.
  • [8] C. S. de Witt, T. Gupta, D. Makoviichuk, V. Makoviychuk, P. H. S. Torr, M. Sun, and S. Whiteson (2020) Is independent learning all you need in the starcraft multi-agent challenge?. arXiv preprint arXiv:2011.09533. External Links: 2011.09533 Cited by: §IV-A, TABLE I.
  • [9] M. Elloumi, R. Dhaou, B. Escrig, H. Idoudi, and L. A. Saïdane (2018) Monitoring road traffic with a uav-based system. In IEEE Wireless Communications and Networking Conf., Cited by: §V-C.
  • [10] J. N. Foerster, Y. M. Assael, N. De Freitas, and S. Whiteson (2016) Learning to communicate with deep multi-agent reinforcement learning. In NeurIPS, pp. 2145–2153. External Links: 1605.06676, ISSN 10495258 Cited by: §III-1, §IV-C, TABLE I.
  • [11] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In

    AAAI Conf. on Artificial Intelligence

    pp. 2974–2982. External Links: 1705.08926, ISBN 9781577358008 Cited by: §IV-B, TABLE I.
  • [12] J. Foerster, S. Whiteson, T. Rashid, M. Samvelyan, C. Schroeder De Witt, and G. Farquhar (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485v1. External Links: 1803.11485v1, ISBN 1803.11485v1 Cited by: §IV-B, TABLE I.
  • [13] X. Ge, Z. Li, and S. Li (2017) 5G software defined vehicular networks. arXiv preprint arXiv:1702.03675. External Links: 1702.03675 Cited by: §V-B, §VI.
  • [14] H. Gu, X. Guo, X. Wei, and R. Xu (2021) Mean-field multi-agent reinforcement learning: a decentralized network approach. arXiv preprint arXiv:2108.02731. External Links: Document, 2108.02731 Cited by: §IV-C, TABLE I.
  • [15] J. Hao, D. Huang, Y. Cai, and H. fung Leung (2017) The dynamics of reinforcement social learning in networked cooperative multiagent systems. Eng. Applicat. of Artificial Intellig. 58, pp. 111–122. External Links: Document, ISSN 09521976 Cited by: §IV-A.
  • [16] P. Hernandez-Leal, B. Kartal, and M. E. Taylor (2019-11) A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems 33 (6), pp. 750–797. External Links: Document, 1810.05587, ISSN 15737454 Cited by: §III-6.
  • [17] J. Hu, H. Zhang, L. Song, R. Schober, and H. V. Poor (2020) Cooperative internet of uavs: distributed trajectory design by multi-agent deep reinforcement learning. IEEE Trans. Commun. 68 (11), pp. 6807–6821. External Links: Document Cited by: §V-C.
  • [18] J. Jiang and Z. Lu (2018) Learning attentional communication for multi-agent cooperation. In NeurIPS, pp. 7254–7264. Cited by: §IV-C, TABLE I.
  • [19] S. Jung, W. J. Yun, J. Kim, J. Kim, and F. Falcone (2021) Coordinated multi-agent deep reinforcement learning for energy-aware UAV-based big-data platforms. External Links: Document Cited by: §V-C, §V-C.
  • [20] E. Kargar and V. Kyrki (2021) MACRPO: multi-agent cooperative recurrent policy optimization. arXiv preprint 2109.00882. Cited by: §IV-D, TABLE I.
  • [21] G. Kontes, D. Scherer, T. Nisslbeck, J. Fischer, and C. Mutschler (2020) High-speed collision avoidance using deep reinforcement learning and domain randomization for autonomous vehicles. In itsc, Cited by: §VI.
  • [22] E. Leurent (2018) An environment for autonomous driving decision-making. GitHub. Note: Cited by: §V-B, TABLE II.
  • [23] J. Li, K. Kuang, B. Wang, F. Liu, L. Chen, F. Wu, and J. Xiao (2021) Shapley counterfactual credits for multi-agent reinforcement learning. In ACM Intl. Conf. Knowl. Disc. and Data Min., pp. 934–942. External Links: Document, 2106.00285, ISBN 9781450383325 Cited by: §IV-B, TABLE I.
  • [24] M. Li, Z. (. Qin, Y. Jiao, Y. Yang, J. Wang, C. Wang, G. Wu, and J. Ye (2019) Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In World Wide Web Conference, San Francisco, CA, pp. 983–994. Cited by: §V-D.
  • [25] X. Li, J. Zhang, J. Bian, Y. Tong, and T. Liu (2019-03) A cooperative multi-agent reinforcement learning framework for resource balancing in complex logistics network. In AAMAS 2019, Cited by: §V-D, §V-D, TABLE II.
  • [26] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. Jordan, and I. Stoica (2018) RLlib: abstractions for distributed reinforcement learning. In ICML, pp. 3053–3062. Cited by: §IV.
  • [27] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In 4th Intl. Conf. Learning Representations, Cited by: §II, §V-C.
  • [28] M. L. Littman (2001) Value-function reinforcement learning in Markov games. Cognitive Systems Research 2 (1), pp. 55–66. External Links: Document, ISSN 13890417 Cited by: §III-4.
  • [29] C. Liu, C. X. Chen, and C. Chen (2021) META: a city-wide taxi repositioning framework based on multi-agent reinforcement learning. IEEE Trans. on Intelligent Transportation Systems. External Links: Document, ISSN 15580016 Cited by: §V-D.
  • [30] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner (2018) Microscopic traffic simulation using sumo. In The 21st IEEE International Conference on Intelligent Transportation Systems, Cited by: §III-5, §V-A, TABLE II.
  • [31] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NeurIPS, pp. 6380–6391. External Links: 1706.02275, ISSN 10495258 Cited by: §IV-D, TABLE I.
  • [32] A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson (2019) MAVEN: multi-agent variational exploration. In NeurIPS, Cited by: §IV-B, TABLE I.
  • [33] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §II, §V-C.
  • [34] S. Mohanty, E. Nygren, F. Laurent, M. Schneider, C. Scheller, N. Bhattacharya, J. Watson, A. Egli, C. Eichenberger, C. Baumberger, G. Vienken, I. Sturm, G. Sartoretti, and G. Spigler (2020) Flatland-RL: multi-agent reinforcement learning on trains. arXiv preprint arXiv:2012.05893. External Links: 2012.05893 Cited by: §V-D, §V-D, TABLE II.
  • [35] I. Mordatch and P. Abbeel (2017) Emergence of grounded compositional language in multi-agent populations. arXiv:1703.04908. Cited by: §I, TABLE II.
  • [36] M. Mozaffari, W. Saad, M. Bennis, Y. Nam, and M. Debbah (2019) A tutorial on uavs for wireless networks: applications, challenges, and open problems. IEEE Commun. Surv. Tutorials 21 (3), pp. 2334–2360. External Links: Document Cited by: §V-C.
  • [37] A. Oroojlooyjadid and D. Hajinezhad (2019) A review of cooperative multi-agent deep reinforcement learning. arXiv preprint 1908.03963. Cited by: §I.
  • [38] P. Palanisamy (2020) Multi-agent connected autonomous driving using deep reinforcement learning. In Int. Joint Conf. on Neural Networks, External Links: Document, 1911.04175, ISBN 9781728169262 Cited by: TABLE II.
  • [39] J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoellig (2021) Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), Cited by: §V-C, TABLE II.
  • [40] G. Papoudakis, F. Christianos, L. Schäfer, and S. V. Albrecht (2020) Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. arXiv preprint arXiv:2006.07869. External Links: 2006.07869, ISSN 2331-8422 Cited by: §IV.
  • [41] A. Pretorius, K. Tessera, A. P. Smit, C. Formanek, S. J. Grimbly, K. Eloff, S. Danisa, L. Francis, J. Shock, H. Kamper, W. Brink, H. Engelbrecht, A. Laterre, and K. Beguir (2021) Mava: a research framework for distributed multi-agent reinforcement learning. External Links: 2107.01460 Cited by: §IV.
  • [42] H. Qie, D. Shi, T. Shen, X. Xu, Y. Li, and L. Wang (2019) Joint optimization of multi-uav target assignment and path planning based on multi-agent reinforcement learning. IEEE Access 7, pp. 146264–146272. External Links: Document Cited by: §V-C, §V-C, §V-C.
  • [43] L. M. Schmidt, G. D. Kontes, A. Plinge, and C. Mutschler (2021-07) Can you trust your autonomous car? interpretable and verifiably safe reinforcement learning. In IEEE Intelligent Vehicles Symposium, Nagoya, Japan, pp. 171–178. Cited by: §V-B, §VI.
  • [44] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv:1707.06347. Cited by: §II.
  • [45] S. Shah, D. Dey, C. Lovett, and A. Kapoor (2017) AirSim: high-fidelity visual and physical simulation for autonomous vehicles. In 11th Intl. Conf. Field and Serv. Robot., Zurich, Switzerland, pp. 621–635. Cited by: §V-C, TABLE II.
  • [46] L. Shapley (1953) A value for n-person games. Ann. Math. Study28, Contributions to the Theory of Games, pp. 307–317. Cited by: §IV-B.
  • [47] A. Singh, T. Jain, and S. Sukhbaatar (2019) Learning when to communicate at scale in multiagent cooperative and competitive tasks. In Int. Conf. on Learning Representations, External Links: 1812.09755 Cited by: §IV-C, TABLE I.
  • [48] M. Stahlke, T. Feigl, M. H. Castaneda Garcia, R. S. Gallacher, J. Seitz, and C. Mutschler (2022) Transfer learning to adapt 5g fingerprint-based localization across environments. In 2022 IEEE 95th Vehicular Technology Conference (VTC2022-Spring), Cited by: §VI.
  • [49] S. Sukhbaatar, A. Szlam, and R. Fergus (2016)

    Learning multiagent communication with backpropagation

    In Advances in Neural Information Processing Systems, pp. 2252–2260. External Links: 1605.07736, ISSN 10495258 Cited by: §IV-C, §IV-C, TABLE I, §V-C.
  • [50] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In Intl. Jnt. Conf. Autonomous Agents and Multiagent Systems, pp. 2085–2087. External Links: 1706.05296v1, ISBN 9781510868083, ISSN 15582914 Cited by: §IV-B, TABLE I.
  • [51] R. S. Sutton and A. G. Barto (1998) Reinforcement learning - an introduction.

    Adaptive computation and machine learning

    , MIT Press.
    External Links: ISBN 978-0-262-19398-6 Cited by: §II.
  • [52] M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. Int. Conf. on Machine Learning, pp. 330–337. Cited by: §IV-A, TABLE I.
  • [53] J. Wang, Y. Zhang, T. K. Kim, and Y. Gu (2020) Shapley Q-value: a local reward approach to solve global reward games. In AAAI Conf. on Artificial Intelligence, pp. 7285–7292. External Links: Document, 1907.05707, ISBN 9781577358350, ISSN 2159-5399 Cited by: §III-6.
  • [54] M. Wang, L. Wu, J. Li, and L. He (2021) Traffic signal control with reinforcement learning based on region-aware cooperative strategy. IEEE Transactions on Intelligent Transportation Systems. Cited by: §V-A.
  • [55] Z. Wang, H. Zhu, M. He, Y. Zhou, X. Luo, and N. Zhang (2021) GAN and multi-agent DRL based decentralized traffic light signal control. IEEE Trans. Veh. Technol.. External Links: Document Cited by: §V-A.
  • [56] D. H. Wolpert and K. Tumer (2001) Optimal payoff functions for members of collectives. Adv. in Compl. Sys. 4 (2/3), pp. 265–279. External Links: Document, ISSN 0219-5259 Cited by: §III-6, §IV-B.
  • [57] Y. Wu, H. Chen, and F. Zhu (2019) DCL-aim: decentralized coordination learning of autonomous intersection management for connected and automated vehicles. Transportation Research Part C: Emerging Technologies 103, pp. 246–260. Cited by: §V-B.
  • [58] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018) Mean field multi-agent reinforcement learning. In Int. Conf. on Machine Learning, Vol. 12, pp. 5571–5580. External Links: 1802.05438, ISBN 9781510867963 Cited by: §IV-C, TABLE I.
  • [59] Y. Yang, Y. Wen, J. Wang, L. Chen, K. Shao, D. Mguni, and W. Zhang (2020) Multi-agent determinantal q-learning. In Int. Conf. on Machine Learning, pp. 10757–10766. Cited by: §IV-B, TABLE I.
  • [60] Y. Yang, Z. Zheng, K. Bian, L. Song, and Z. Han (2018) Real-time profiling of fine-grained air quality index distribution using UAV sensing. IEEE Internet Things J. 5 (1), pp. 186–198. External Links: Document Cited by: §V-C.
  • [61] C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. Bayen, and Y. Wu (2021) The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955. External Links: 2103.01955 Cited by: §IV-D, TABLE I.
  • [62] H. Zhang, S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li (2019) CityFlow: A multi-agent reinforcement learning environment for large scale city traffic scenario. In World Wide Web Conf., San Francisco, CA, pp. 3620–3624. Cited by: §III-5, §V-A, §V-A, TABLE II.
  • [63] K. Zhang, Z. Yang, and T. Başar (2021) Multi-agent reinforcement learning: a selective overview of theories and algorithms. In Studies in Systems, Decision and Control, Vol. 325, pp. 321–384. External Links: Document, 1911.10635, ISSN 21984190 Cited by: §III-4.
  • [64] L. Zheng, J. Yang, H. Cai, W. Zhang, J. Wang, and Y. Yu (2018) MAgent: a many-agent reinforcement learning platform for artificial collective intelligence. In AAAI Conf. Artificial Intellig., pp. 8222–8223. External Links: 1712.00600, ISBN 9781577358008 Cited by: TABLE II.
  • [65] M. Zhou, Z. Liu, P. Sui, Y. Li, and Y. Y. Chung (2020-12) Learning implicit credit assignment for cooperative multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, External Links: 2007.02529, ISSN 10495258 Cited by: §IV-B.
  • [66] M. Zhou, J. Luo, J. Villella, Y. Yang, D. Rusu, J. Miao, W. Zhang, M. Alban, I. Fadakar, Z. Chen, A. C. Huang, Y. Wen, K. Hassanzadeh, D. Graves, D. Chen, Z. Zhu, N. Nguyen, M. Elsayed, K. Shao, S. Ahilan, B. Zhang, J. Wu, Z. Fu, K. Rezaee, P. Yadmellat, M. Rohani, N. P. Nieves, Y. Ni, S. Banijamali, A. C. Rivers, Z. Tian, D. Palenicek, H. bou Ammar, H. Zhang, W. Liu, J. Hao, and J. Wang (2020-11) SMARTS: scalable multi-agent reinforcement learning training school for autonomous driving. In Conf. on Robot Learning, Cited by: §V-B, TABLE II.
  • [67] M. Zhou, Z. Wan, H. Wang, M. Wen, R. Wu, Y. Wen, Y. Yang, W. Zhang, and J. Wang (2021) MALib: a parallel framework for population-based multi-agent reinforcement learning. External Links: 2106.07551 Cited by: §IV.