Contextual Bandits and Imitation Learning via Preference-Based Active Queries

07/24/2023
by   Ayush Sekhari, et al.
0

We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively query an expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is two-fold: to minimize the regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions, and provide an algorithm that leverages an online regression oracle with respect to this function class for choosing its actions and deciding when to query. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as O(min{√(T), d/Δ}), where T represents the number of interactions, d represents the eluder dimension of the function class, and Δ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of Δ, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only O(min{T, d^2/Δ^2}) queries to the expert. We then extend our algorithm to the imitation learning setting, where the learning agent engages with an unknown environment in episodes of length H each, and provide similar guarantees for regret and query complexity. Interestingly, our algorithm for imitation learning can even learn to outperform the underlying expert, when it is suboptimal, highlighting a practical benefit of preference-based feedback in imitation learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/11/2023

Selective Sampling and Imitation Learning via Online Regression

We consider the problem of Imitation Learning (IL) by actively querying ...
research
06/22/2023

Context-lumpable stochastic bandits

We consider a contextual bandit problem with S contexts and A actions. I...
research
09/13/2020

Toward the Fundamental Limits of Imitation Learning

Imitation learning (IL) aims to mimic the behavior of an expert policy i...
research
06/06/2019

Stochastic Bandits with Context Distributions

We introduce a novel stochastic contextual bandit model, where at each s...
research
02/17/2021

Fully General Online Imitation Learning

In imitation learning, imitators and demonstrators are policies for pick...
research
06/13/2011

Efficient Optimal Learning for Contextual Bandits

We address the problem of learning in an online setting where the learne...
research
11/02/2020

Stochastic Linear Bandits with Protected Subspace

We study a variant of the stochastic linear bandit problem wherein we op...

Please sign up or login with your details

Forgot password? Click here to reset