Gap-Dependent Bounds for Two-Player Markov Games

07/01/2021
by   Zehao Dou, et al.
0

As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2020

Q-learning with Logarithmic Regret

This paper presents the first non-asymptotic result showing that a model...
research
02/15/2021

Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation

We study reinforcement learning for two-player zero-sum Markov games wit...
research
05/09/2019

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

This paper establishes that optimistic algorithms attain gap-dependent a...
research
09/08/2021

Learning Zero-sum Stochastic Games with Posterior Sampling

In this paper, we propose Posterior Sampling Reinforcement Learning for ...
research
03/03/2022

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

In this paper, we study the problem of regret minimization for episodic ...
research
03/07/2022

Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning

In this paper, we study gap-dependent regret guarantees for risk-sensiti...
research
07/22/2022

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

We present regret minimization algorithms for stochastic contextual MDPs...

Please sign up or login with your details

Forgot password? Click here to reset