Nested bandits

06/19/2022
by   Matthieu Martin, et al.
0

In many online decision processes, the optimizing agent is called to choose between large numbers of alternatives with many inherent similarities; in turn, these similarities imply closely correlated losses that may confound standard discrete choice models and bandit algorithms. We study this question in the context of nested bandits, a class of adversarial multi-armed bandit problems where the learner seeks to minimize their regret in the presence of a large number of distinct alternatives with a hierarchy of embedded (non-combinatorial) similarities. In this setting, optimal algorithms based on the exponential weights blueprint (like Hedge, EXP3, and their variants) may incur significant regret because they tend to spend excessive amounts of time exploring irrelevant alternatives with similar, suboptimal costs. To account for this, we propose a nested exponential weights (NEW) algorithm that performs a layered exploration of the learner's set of alternatives based on a nested, step-by-step selection method. In so doing, we obtain a series of tight bounds for the learner's regret showing that online learning problems with a high degree of similarity between alternatives can be resolved efficiently, without a red bus / blue bus paradox occurring.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/30/2018

Preference-based Online Learning with Dueling Bandits: A Survey

In machine learning, the notion of multi-armed bandits refers to a class...
research
02/23/2015

First-order regret bounds for combinatorial semi-bandits

We consider the problem of online combinatorial optimization under semi-...
research
10/17/2018

Simple Regret Minimization for Contextual Bandits

There are two variants of the classical multi-armed bandit (MAB) problem...
research
03/19/2018

What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

An online reinforcement learning algorithm is anytime if it does not nee...
research
09/30/2014

Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

We present and study a partial-information model of online learning, whe...
research
01/29/2019

Improved Path-length Regret Bounds for Bandits

We study adaptive regret bounds in terms of the variation of the losses ...
research
03/17/2015

Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits

We propose a sample-efficient alternative for importance weighting for s...

Please sign up or login with your details

Forgot password? Click here to reset