No-Regret Learning in Unknown Games with Correlated Payoffs

09/18/2019
by   Pier Giuseppe Sessa, et al.
33

We consider the problem of learning to play a repeated multi-agent game with an unknown reward function. Single player online learning algorithms attain strong regret bounds when provided with full information feedback, which unfortunately is unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing outcomes only for the selected action, yields substantially worse performance. In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents' actions. This feedback model, together with a regularity assumption on the reward function, allows us to exploit the correlations among different game outcomes by means of Gaussian processes (GPs). We propose a novel confidence-bound based bandit algorithm GP-MW, which utilizes the GP model for the reward function and runs a multiplicative weight (MW) method. We obtain novel kernel-dependent regret bounds that are comparable to the known bounds in the full information setting, while substantially improving upon the existing bandit results. We experimentally demonstrate the effectiveness of GP-MW in random matrix games, as well as real-world problems of traffic routing and movie recommendation. In our experiments, GP-MW consistently outperforms several baselines, while its performance is often comparable to methods that have access to full information feedback.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback

We consider a combinatorial multi-armed bandit problem for maximum value...
research
01/25/2016

Time-Varying Gaussian Process Bandit Optimization

We consider the sequential Bayesian optimization problem with bandit fee...
research
05/06/2023

A Novel Reward Shaping Function for Single-Player Mahjong

Mahjong is a complex game with an intractably large state space with ext...
research
07/10/2020

Learning to Play Sequential Games versus Unknown Opponents

We consider a repeated sequential game between a learner, who plays firs...
research
09/03/2010

Gaussian Process Bandits for Tree Search: Theory and Application to Planning in Discounted MDPs

We motivate and analyse a new Tree Search algorithm, GPTS, based on rece...
research
07/13/2021

Contextual Games: Multi-Agent Learning with Side Information

We formulate the novel class of contextual games, a type of repeated gam...
research
02/04/2014

Online Stochastic Optimization under Correlated Bandit Feedback

In this paper we consider the problem of online stochastic optimization ...

Please sign up or login with your details

Forgot password? Click here to reset