Bandit Learning Through Biased Maximum Likelihood Estimation

07/02/2019
by   Xi Liu, et al.
0

We propose BMLE, a new family of bandit algorithms, that are formulated in a general way based on the Biased Maximum Likelihood Estimation method originally appearing in the adaptive control literature. We design the cost-bias term to tackle the exploration and exploitation tradeoff for stochastic bandit problems. We provide an explicit closed form expression for the index of an arm for Bernoulli bandits, which is trivial to compute. We also provide a general recipe for extending the BMLE algorithm to other families of reward distributions. We prove that for Bernoulli bandits, the BMLE algorithm achieves a logarithmic finite-time regret bound and hence attains order-optimality. Through extensive simulations, we demonstrate that the proposed algorithms achieve regret performance comparable to the best of several state-of-the-art baseline methods, while having a significant computational advantage in comparison to other best performing methods. The generality of the proposed approach makes it possible to address more complex models, including general adaptive control of Markovian systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2020

Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

Modifying the reward-biased maximum likelihood method originally propose...
research
03/08/2022

Neural Contextual Bandits via Reward-Biased Maximum Likelihood Estimation

Reward-biased maximum likelihood estimation (RBMLE) is a classic princip...
research
11/16/2020

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

The principle of Reward-Biased Maximum Likelihood Estimate Based Adaptiv...
research
05/30/2022

Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets

Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important ...
research
09/12/2017

Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

In this paper, we propose and study opportunistic bandits - a new varian...
research
08/12/2015

No Regret Bound for Extreme Bandits

Algorithms for hyperparameter optimization abound, all of which work wel...
research
10/30/2021

Efficient Inference Without Trading-off Regret in Bandits: An Allocation Probability Test for Thompson Sampling

Using bandit algorithms to conduct adaptive randomised experiments can m...

Please sign up or login with your details

Forgot password? Click here to reset