Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games

06/03/2022
by   Wenhao Zhan, et al.
14

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result <cit.>, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, Decentralized Optimistic hypeRpolicy mIrror deScent (DORIS), which achieves √(K)-regret in the context of general function approximation, where K is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a hyperpolicy which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

This paper addresses the problem of learning an equilibrium efficiently ...
research
02/13/2023

Breaking the Curse of Multiagency: Provably Efficient Decentralized Multi-Agent RL with Function Approximation

A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the c...
research
07/25/2022

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

While single-agent policy optimization in a fixed environment has attrac...
research
03/14/2022

Learning Markov Games with Adversarial Opponents: Efficient Algorithms and Fundamental Limits

An ideal strategy in zero-sum games should not only grant the player an ...
research
08/07/2023

Asynchronous Decentralized Q-Learning: Two Timescale Analysis By Persistence

Non-stationarity is a fundamental challenge in multi-agent reinforcement...
research
10/27/2021

V-Learning – A Simple, Efficient, Decentralized Algorithm for Multiagent RL

A major challenge of multiagent reinforcement learning (MARL) is the cur...
research
05/26/2022

Logit-Q Learning in Markov Games

We present new independent learning dynamics provably converging to an e...

Please sign up or login with your details

Forgot password? Click here to reset