Adversarial Combinatorial Bandits with General Non-linear Reward Functions
In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback. We show that, with N arms and subsets of K arms being chosen at each of T time periods, the minimax optimal regret is Θ_d(√(N^d T)) if the reward function is a d-degree polynomial with d< K, and Θ_K(√(N^K T)) if the reward function is not a low-degree polynomial. Both bounds are significantly different from the bound O(√(poly(N,K)T)) for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures. Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual NK assortment as independent.
READ FULL TEXT