Coordination without communication: optimal regret in two players multi-armed bandits

02/14/2020
by   Sébastien Bubeck, et al.
20

We consider two agents playing simultaneously the same stochastic three-armed bandit problem. The two agents are cooperating but they cannot communicate. We propose a strategy with no collisions at all between the players (with very high probability), and with near-optimal regret O(√(T log(T))). We also provide evidence that the extra logarithmic term √(log(T)) is necessary, with a lower bound for a variant of the problem.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset