: Fair Multi-Armed Bandits with Guaranteed Rewards per Arm

04/11/2023
by   Abhishek Sinha, et al.
0

Classic no-regret online prediction algorithms, including variants of the Upper Confidence Bound () algorithm, , and , are inherently unfair by design. The unfairness stems from their very objective of playing the most rewarding arm as many times as possible while ignoring the less rewarding ones among N arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for a set of arms. We study the problem in both full and bandit feedback settings. Using queueing-theoretic techniques in conjunction with adversarial learning, we propose a new online prediction policy called that achieves the target reward rates while achieving a regret and target rate violation penalty of O(T^3/4). In the full-information setting, the regret bound can be further improved to O(√(T)) when considering the average regret over the entire horizon of length T. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard MAB problem with a carefully defined sequence of rewards. The design and analysis of the policy involve a novel use of the potential function method in conjunction with scale-free second-order regret bounds and a new self-bounding inequality for the reward gradients, which are of independent interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/04/2022

Online Learning and Bandits with Queried Hints

We consider the classic online learning and stochastic multi-armed bandi...
research
10/11/2022

Trading Off Resource Budgets for Improved Regret Bounds

In this work we consider a variant of adversarial online learning where ...
research
02/12/2018

Multi-Armed Bandits on Unit Interval Graphs

An online learning problem with side information on the similarity and d...
research
05/16/2020

Learning and Optimization with Seasonal Patterns

Seasonality is a common form of non-stationary patterns in the business ...
research
12/04/2020

One-bit feedback is sufficient for upper confidence bound policies

We consider a variant of the traditional multi-armed bandit problem in w...
research
03/11/2023

No-regret Algorithms for Fair Resource Allocation

We consider a fair resource allocation problem in the no-regret setting ...
research
12/24/2021

Gaussian Process Bandits with Aggregated Feedback

We consider the continuum-armed bandits problem, under a novel setting o...

Please sign up or login with your details

Forgot password? Click here to reset