Model selection for contextual bandits

06/03/2019 ∙ by Dylan J. Foster, et al. ∙ 0

We introduce the problem of model selection for contextual bandits, wherein a learner must adapt to the complexity of the optimal policy while balancing exploration and exploitation. Our main result is a new model selection guarantee for linear contextual bandits. We work in the stochastic realizable setting with a sequence of nested linear policy classes of dimension d_1 < d_2 < ..., where the m^-th class contains the optimal policy, and we design an algorithm that achieves Õ(T^2/3d^1/3_m^) regret with no prior knowledge of the optimal dimension d_m^. The algorithm also achieves regret Õ(T^3/4 + √(Td_m^)), which is optimal for d_m^≥√(T). This is the first contextual bandit model selection result with non-vacuous regret for all values of d_m^ and, to the best of our knowledge, is the first guarantee of its type in any contextual bandit setting. The core of the algorithm is a new estimator for the gap in best loss achievable by two linear policy classes, which we show admits a convergence rate faster than what is required to learn either class.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.