MNL-Bandit in non-stationary environments
In this paper, we study the MNL-Bandit problem in a non-stationary environment and present an algorithm with worst-case dynamic regret of Õ( min{√(NTL) , N^1/3(Δ_∞^K)^1/3 T^2/3 + √(NT)}). Here N is the number of arms, L is the number of switches and Δ_∞^K is a variation measure of the unknown parameters. We also show that our algorithm is near-optimal (up to logarithmic factors). Our algorithm builds upon the epoch-based algorithm for stationary MNL-Bandit in Agrawal et al. 2016. However, non-stationarity poses several challenges and we introduce new techniques and ideas to address these. In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.
READ FULL TEXT