Complete Policy Regret Bounds for Tallying Bandits

04/24/2022
by   Dhruv Malik, et al.
0

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the complete policy regret, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the tallying bandit. This is an online learning setting with an m-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last m timesteps. For tallying bandit problems with K actions and time horizon T, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of 𝒪̃(mK√(T)), where the 𝒪̃ notation hides only logarithmic factors. We additionally prove an Ω̃(√(m K T)) lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2012

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Online learning algorithms are designed to learn even when their input i...
research
05/04/2023

Weighted Tallying Bandits: Overcoming Intractability via Repeated Exposure Optimality

In recommender system or crowdsourcing applications of online learning, ...
research
05/29/2019

Regret Bounds for Thompson Sampling in Restless Bandit Problems

Restless bandit problems are instances of non-stationary multi-armed ban...
research
05/26/2017

Online Auctions and Multi-scale Online Learning

We consider revenue maximization in online auctions and pricing. A selle...
research
12/15/2020

Policy Optimization as Online Learning with Mediator Feedback

Policy Optimization (PO) is a widely used approach to address continuous...
research
02/09/2018

Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

Regret bounds in online learning compare the player's performance to L^*...
research
02/18/2013

Online Learning with Switching Costs and Other Adaptive Adversaries

We study the power of different types of adaptive (nonoblivious) adversa...

Please sign up or login with your details

Forgot password? Click here to reset