Observe Before Play: Multi-armed Bandit with Pre-observations

11/21/2019
by   Jinhang Zuo, et al.
0

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for K arms with Bernoulli rewards, and prove a T-round regret upper bound O(K^2log T). In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its T-round regret relative to an offline greedy strategy is upper bounded in O(K^4/M^2log T) for K arms and M players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2019

Multi-Armed Bandits with Correlated Arms

We consider a multi-armed bandit framework where the rewards obtained by...
research
10/20/2018

Quantifying the Burden of Exploration and the Unfairness of Free Riding

We consider the multi-armed bandit setting with a twist. Rather than hav...
research
05/04/2015

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

We consider the problem of learning in single-player and multiplayer mul...
research
06/19/2019

Learning in Restless Multi-Armed Bandits via Adaptive Arm Sequencing Rules

We consider a class of restless multi-armed bandit (RMAB) problems with ...
research
12/12/2018

On Distributed Multi-player Multiarmed Bandit Problems in Abruptly Changing Environment

We study the multi-player stochastic multiarmed bandit (MAB) problem in ...
research
04/01/2022

Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk

We investigate a natural but surprisingly unstudied approach to the mult...
research
11/05/2019

Response Prediction for Low-Regret Agents

Companies like Google and Microsoft run billions of auctions every day t...

Please sign up or login with your details

Forgot password? Click here to reset