Bandit Convex Optimization in Non-stationary Environments
Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values. In this paper, we investigate BCO in non-stationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let T be the time horizon and P_T be the path-length of the comparator sequence that reflects the non-stationary of environments. We propose a novel algorithm that achieves O(T^3/4(1+P_T)^1/2) and O(T^1/2(1+P_T)^1/2) dynamic regret respectively for the one-point and two-point feedback models. The latter result is optimal, matching the Ω(T^1/2(1+P_T)^1/2) lower bound established in this paper. Notably, our algorithm is more adaptive to non-stationary environments since it does not require prior knowledge of the path-length P_T ahead of time, which is generally unknown.
READ FULL TEXT