Online Learning with Off-Policy Feedback

07/18/2022
by   Germano Gabbianelli, et al.
3

We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of off-policy reinforcement learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2018

Adversarial Online Learning with noise

We present and study models of adversarial online learning where the fee...
research
12/15/2020

Policy Optimization as Online Learning with Mediator Feedback

Policy Optimization (PO) is a widely used approach to address continuous...
research
09/28/2022

Online Subset Selection using α-Core with no Augmented Regret

We consider the problem of sequential sparse subset selections in an onl...
research
02/19/2020

Optimistic Policy Optimization with Bandit Feedback

Policy optimization methods are one of the most widely used classes of R...
research
09/15/2018

Incorporating Behavioral Constraints in Online AI Systems

AI systems that learn through reward feedback about the actions they tak...
research
02/08/2015

Learning to Search Better Than Your Teacher

Methods for learning to search for structured prediction typically imita...
research
08/01/2022

Boosted Off-Policy Learning

We investigate boosted ensemble models for off-policy learning from logg...

Please sign up or login with your details

Forgot password? Click here to reset