Bandits with Delayed Anonymous Feedback

09/20/2017

∙

We study the bandits with delayed anonymous feedback problem, a variant of the stochastic K-armed bandit problem, in which the reward from each play of an arm is no longer obtained instantaneously but received after some stochastic delay. Furthermore, the learner is not told which arm an observation corresponds to, nor do they observe the delay associated with a play. Instead, at each time step, the learner selects an arm to play and receives a reward which could be from any combination of past plays. This is a very natural problem; however, due to the delay and anonymity of the observations, it is considerably harder than the standard bandit problem. Despite this, we demonstrate it is still possible to achieve logarithmic regret, but with additional lower order terms. In particular, we provide an algorithm with regret O((T) + √(g(τ) (T)) + g(τ)) where g(τ) is some function of the delay distribution. This is of the same order as that achieved in Joulani et al. (2013) for the simpler problem where the observations are not anonymous. We support our theoretical observation equating the two orders of regret with experiments.

READ FULL TEXT

Bandits with Delayed Anonymous Feedback

Sign in with Google

Consider DeepAI Pro