A Regret bound for Non-stationary Multi-Armed Bandits with Fairness Constraints
The multi-armed bandits' framework is the most common platform to study strategies for sequential decision-making problems. Recently, the notion of fairness has attracted a lot of attention in the machine learning community. One can impose the fairness condition that at any given point of time, even during the learning phase, a poorly performing candidate should not be preferred over a better candidate. This fairness constraint is known to be one of the most stringent and has been studied in the stochastic multi-armed bandits' framework in a stationary setting for which regret bounds have been established. The main aim of this paper is to study this problem in a non-stationary setting. We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe algorithm for solving a slowly varying stochastic k-armed bandit problem. With this we present two results: (i) Fair-UCBe indeed satisfies the above mentioned fairness condition, and (ii) it achieves a regret bound of O(k^3/2 T^1 - α/2√(log T)), for some suitable α∈ (0, 1), where T is the time horizon. This is the first fair algorithm with a sublinear regret bound applicable to non-stationary bandits to the best of our knowledge. We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart as the variation in the environment tends to zero.
READ FULL TEXT