Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit

06/29/2021
by   Yichi Zhou, et al.
0

Balancing exploration and exploitation (EE) is a fundamental problem in contex-tual bandit. One powerful principle for EE trade-off isOptimism in Face of Uncer-tainty(OFU), in which the agent takes the action according to an upper confidencebound (UCB) of reward. OFU has achieved (near-)optimal regret bound for lin-ear/kernel contextual bandits. However, it is in general unknown how to deriveefficient and effective EE trade-off methods for non-linearcomplex tasks, suchas contextual bandit with deep neural network as the reward function. In thispaper, we propose a novel OFU algorithm namedregularized OFU(ROFU). InROFU, we measure the uncertainty of the reward by a differentiable function andcompute the upper confidence bound by solving a regularized optimization prob-lem. We prove that, for multi-armed bandit, kernel contextual bandit and neuraltangent kernel bandit, ROFU achieves (near-)optimal regret bounds with certainuncertainty measure, which theoretically justifies its effectiveness on EE trade-off.Importantly, ROFU admits a very efficient implementation with gradient-basedoptimizer, which easily extends to general deep neural network models beyondneural tangent kernel, in sharp contrast with previous OFU methods. The em-pirical evaluation demonstrates that ROFU works extremelywell for contextualbandits under various settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2019

Neural Contextual Bandits with Upper Confidence Bound-Based Exploration

We study the stochastic contextual bandit problem, where the reward is g...
research
09/07/2022

Dual Instrumental Method for Confounded Kernelized Bandits

The contextual bandit problem is a theoretically justified framework wit...
research
07/15/2020

Upper Counterfactual Confidence Bounds: a New Optimism Principle for Contextual Bandits

The principle of optimism in the face of uncertainty is one of the most ...
research
10/02/2020

Neural Thompson Sampling

Thompson Sampling (TS) is one of the most effective algorithms for solvi...
research
12/03/2020

Neural Contextual Bandits with Deep Representation and Shallow Exploration

We study a general class of contextual bandits, where each context-actio...
research
01/24/2022

Learning Contextual Bandits Through Perturbed Rewards

Thanks to the power of representation learning, neural contextual bandit...
research
04/12/2021

An Efficient Algorithm for Deep Stochastic Contextual Bandits

In stochastic contextual bandit (SCB) problems, an agent selects an acti...

Please sign up or login with your details

Forgot password? Click here to reset