Off-Policy Correction For Multi-Agent Reinforcement Learning

11/22/2021
by   Michał Zawalski, et al.
0

Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with no impact on the quality of training. Furthermore, our algorithm is theoretically grounded - we prove a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them.

READ FULL TEXT
research
03/15/2019

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

This paper extends off-policy reinforcement learning to the multi-agent ...
research
07/03/2021

Traffic Signal Control with Communicative Deep Reinforcement Learning Agents: a Case Study

In this work we analyze Multi-Agent Advantage Actor-Critic (MA2C) a rece...
research
10/01/2021

Divergence-Regularized Multi-Agent Actor-Critic

Entropy regularization is a popular method in reinforcement learning (RL...
research
10/31/2021

Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

We discuss the problem of decentralized multi-agent reinforcement learni...
research
02/05/2018

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

In this work we aim to solve a large collection of tasks using a single ...
research
11/07/2021

Coordinated Proximal Policy Optimization

We present Coordinated Proximal Policy Optimization (CoPPO), an algorith...
research
11/11/2019

SMIX(λ): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning

Learning a stable and generalizable centralized value function (CVF) is ...

Please sign up or login with your details

Forgot password? Click here to reset