A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum Markov Games

03/17/2023
by   Anna Winnicki, et al.
0

Many model-based reinforcement learning (RL) algorithms can be viewed as having two phases that are iteratively implemented: a learning phase where the model is approximately learned and a planning phase where the learned model is used to derive a policy. In the case of standard MDPs, the learning problem can be solved using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in Hansen et al. (2013) that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zero-sum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway often used in RL for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the linear MDPs and have been the subject of much attention recently. We then consider multi-agent reinforcement learning which uses our algorithm in the planning phases, and provide sample and time complexity bounds for such an algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2020

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

Model-based algorithms—algorithms that decouple learning of the model an...
research
05/31/2022

One Policy is Enough: Parallel Exploration with a Single Policy is Minimax Optimal for Reward-Free Reinforcement Learning

While parallelism has been extensively used in Reinforcement Learning (R...
research
10/19/2021

On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

To achieve sample efficiency in reinforcement learning (RL), it necessit...
research
07/22/2021

Distributed Asynchronous Policy Iteration for Sequential Zero-Sum Games and Minimax Control

We introduce a contractive abstract dynamic programming framework and re...
research
02/02/2023

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Many policy-based reinforcement learning (RL) algorithms can be viewed a...
research
02/08/2023

Modified Policy Iteration for Exponential Cost Risk Sensitive MDPs

Modified policy iteration (MPI) also known as optimistic policy iteratio...
research
06/09/2023

Finite-Time Analysis of Minimax Q-Learning for Two-Player Zero-Sum Markov Games: Switching System Approach

The objective of this paper is to investigate the finite-time analysis o...

Please sign up or login with your details

Forgot password? Click here to reset