Consolidation via Policy Information Regularization in Deep RL for Multi-Agent Games

by   Tyler Malloy, et al.

This paper introduces an information-theoretic constraint on learned policy complexity in the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) reinforcement learning algorithm. Previous research with a related approach in continuous control experiments suggests that this method favors learning policies that are more robust to changing environment dynamics. The multi-agent game setting naturally requires this type of robustness, as other agents' policies change throughout learning, introducing a nonstationary environment. For this reason, recent methods in continual learning are compared to our approach, termed Capacity-Limited MADDPG. Results from experimentation in multi-agent cooperative and competitive tasks demonstrate that the capacity-limited approach is a good candidate for improving learning performance in these environments.


page 7

page 8

page 9


MAGNet: Multi-agent Graph Network for Deep Multi-agent Reinforcement Learning

Over recent years, deep reinforcement learning has shown strong successe...

Smart Containers With Bidding Capacity: A Policy Gradient Algorithm for Semi-Cooperative Learning

Smart modular freight containers – as propagated in the Physical Interne...

Deep reinforcement learning of event-triggered communication and control for multi-agent cooperative transport

In this paper, we explore a multi-agent reinforcement learning approach ...

UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers

Recent advances in multi-agent reinforcement learning have been largely ...

APIA: An Architecture for Policy-Aware Intentional Agents

This paper introduces the APIA architecture for policy-aware intentional...

Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi

In pursuit of enhanced multi-agent collaboration, we analyze several on-...

Multi-Agent Reinforcement Learning for Persistent Monitoring

The Persistent Monitoring (PM) problem seeks to find a set of trajectori...