Stochastic matrix games with bandit feedback

06/09/2020
by   Brendan O'Donoghue, et al.
0

We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best response to the learner's mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/17/2019

Competing Against Equilibria in Zero-Sum Games with Evolving Payoffs

We study the problem of repeated play in a zero-sum game in which the pa...
research
05/15/2022

Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games

Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model...
research
11/16/2022

Dueling Bandits: From Two-dueling to Multi-dueling

We study a general multi-dueling bandit problem, where an agent compares...
research
07/28/2021

Efficient Episodic Learning of Nonstationary and Unknown Zero-Sum Games Using Expert Game Ensembles

Game theory provides essential analysis in many applications of strategi...
research
10/03/2018

Bandit learning in concave N-person games

This paper examines the long-run behavior of learning with bandit feedba...
research
05/14/2022

No-regret learning for repeated non-cooperative games with lossy bandits

This paper considers no-regret learning for repeated continuous-kernel g...
research
06/13/2022

No-Regret Learning in Games with Noisy Feedback: Faster Rates and Adaptivity via Learning Rate Separation

We examine the problem of regret minimization when the learner is involv...

Please sign up or login with your details

Forgot password? Click here to reset