Independent Policy Gradient Methods for Competitive Reinforcement Learning

by   Constantinos Daskalakis, et al.

We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent policy gradient methods in competitive RL; prior work has largely focused on centralized, coordinated procedures for equilibrium computation.


page 1

page 2

page 3

page 4


Multi-Agent Reinforcement Learning in Cournot Games

In this work, we study the interaction of strategic agents in continuous...

Policy Gradient Methods Find the Nash Equilibrium in N-player General-sum Linear-quadratic Games

We consider a general-sum N-player linear-quadratic game with stochastic...

Block Policy Mirror Descent

In this paper, we present a new class of policy gradient (PG) methods, n...

Competitive Policy Optimization

A core challenge in policy optimization in competitive Markov decision p...

Convergence of Q-value in case of Gaussian rewards

In this paper, as a study of reinforcement learning, we converge the Q f...

Temporal Induced Self-Play for Stochastic Bayesian Games

One practical requirement in solving dynamic games is to ensure that the...