Duelling Bandits with Weak Regret in Adversarial Environments

12/10/2018
by   Lennard Hilgendorf, et al.
0

Research on the multi-armed bandit problem has studied the trade-off of exploration and exploitation in depth. However, there are numerous applications where the cardinal absolute-valued feedback model (e.g. ratings from one to five) is not suitable. This has motivated the formulation of the duelling bandits problem, where the learner picks a pair of actions and observes a noisy binary feedback, indicating a relative preference between the two. There exist a multitude of different settings and interpretations of the problem for two reasons. First, due to the absence of a total order of actions, there is no natural definition of the best action. Existing work either explicitly assumes the existence of a linear order, or uses a custom definition for the winner. Second, there are multiple reasonable notions of regret to measure the learner's performance. Most prior work has been focussing on the strong regret, which averages the quality of the two actions picked. This work focusses on the weak regret, which is based on the quality of the better of the two actions selected. Weak regret is the more appropriate performance measure when the pair's inferior action has no significant detrimental effect on the pair's quality. We study the duelling bandits problem in the adversarial setting. We provide an algorithm which has theoretical guarantees in both the utility-based setting, which implies a total order, and the unrestricted setting. For the latter, we work with the Borda winner, finding the action maximising the probability of winning against an action sampled uniformly at random. The thesis concludes with experimental results based on both real-world data and synthetic data, showing the algorithm's performance and limitations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2018

Simple Regret Minimization for Contextual Bandits

There are two variants of the classical multi-armed bandit (MAB) problem...
research
02/07/2019

KLUCB Approach to Copeland Bandits

Multi-armed bandit(MAB) problem is a reinforcement learning framework wh...
research
06/18/2019

Simple Algorithms for Dueling Bandits

In this paper, we present simple algorithms for Dueling Bandits. We prov...
research
11/16/2020

Corrupted Contextual Bandits with Action Order Constraints

We consider a variant of the novel contextual bandit problem with corrup...
research
10/27/2020

Adversarial Dueling Bandits

We introduce the problem of regret minimization in Adversarial Dueling B...
research
11/16/2022

Dynamical Linear Bandits

In many real-world sequential decision-making problems, an action does n...
research
12/03/2018

Thompson Sampling for Noncompliant Bandits

Thompson sampling, a Bayesian method for balancing exploration and explo...

Please sign up or login with your details

Forgot password? Click here to reset