Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

01/31/2022
by   Mingfei Sun, et al.
0

We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, i.e., Independent Proximal Policy Optimization (IPPO) and Multi-Agent PPO (MAPPO), which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Moreover, we show that the surrogate objectives optimized in IPPO and MAPPO are essentially equivalent when their critics converge to a fixed point. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and the good values of the hyperparameters for this enforcement are highly sensitive to the number of agents, as predicted by our theoretical analysis.

READ FULL TEXT

page 8

page 16

page 18

research
11/06/2022

Decentralized Policy Optimization

The study of decentralized learning or independent learning in cooperati...
research
02/15/2023

Trust-Region-Free Policy Optimization for Stochastic Policies

Trust Region Policy Optimization (TRPO) is an iterative method that simu...
research
12/03/2021

An Analytical Update Rule for General Policy Optimization

We present an analytical policy update rule that is independent of param...
research
02/16/2023

Model-Based Decentralized Policy Optimization

Decentralized policy optimization has been commonly used in cooperative ...
research
02/21/2021

Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition

Non-stationarity is one thorny issue in multi-agent reinforcement learni...
research
01/31/2022

You May Not Need Ratio Clipping in PPO

Proximal Policy Optimization (PPO) methods learn a policy by iteratively...
research
11/04/2021

Towards an Understanding of Default Policies in Multitask Policy Optimization

Much of the recent success of deep reinforcement learning has been drive...

Please sign up or login with your details

Forgot password? Click here to reset