Competitive Policy Optimization

06/18/2020
by   Manish Prajapat, et al.
16

A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2018

Trust Region Policy Optimization of POMDPs

We propose Generalized Trust Region Policy Optimization (GTRPO), a Reinf...
research
12/28/2022

On the Convergence of Discounted Policy Gradient Methods

Many popular policy gradient methods for reinforcement learning follow a...
research
04/17/2019

Off-Policy Policy Gradient with State Distribution Correction

We study the problem of off-policy policy optimization in Markov decisio...
research
11/06/2018

Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?

We study how the behavior of deep policy gradient algorithms reflects th...
research
06/14/2022

How are policy gradient methods affected by the limits of control?

We study stochastic policy gradient methods from the perspective of cont...
research
06/01/2019

Neural Replicator Dynamics

In multiagent learning, agents interact in inherently nonstationary envi...
research
06/10/2020

Robust Detection of Adaptive Spammers by Nash Reinforcement Learning

Online reviews provide product evaluations for customers to make decisio...

Please sign up or login with your details

Forgot password? Click here to reset