: Fair Multi-Armed Bandits with Guaranteed Rewards per Arm
Classic no-regret online prediction algorithms, including variants of the Upper Confidence Bound () algorithm, , and , are inherently unfair by design. The unfairness stems from their very objective of playing the most rewarding arm as many times as possible while ignoring the less rewarding ones among N arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for a set of arms. We study the problem in both full and bandit feedback settings. Using queueing-theoretic techniques in conjunction with adversarial learning, we propose a new online prediction policy called that achieves the target reward rates while achieving a regret and target rate violation penalty of O(T^3/4). In the full-information setting, the regret bound can be further improved to O(√(T)) when considering the average regret over the entire horizon of length T. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard MAB problem with a carefully defined sequence of rewards. The design and analysis of the policy involve a novel use of the potential function method in conjunction with scale-free second-order regret bounds and a new self-bounding inequality for the reward gradients, which are of independent interest.
READ FULL TEXT