An Efficient Algorithm for Deep Stochastic Contextual Bandits

by   Tan Zhu, et al.

In stochastic contextual bandit (SCB) problems, an agent selects an action based on certain observed context to maximize the cumulative reward over iterations. Recently there have been a few studies using a deep neural network (DNN) to predict the expected reward for an action, and the DNN is trained by a stochastic gradient based method. However, convergence analysis has been greatly ignored to examine whether and where these methods converge. In this work, we formulate the SCB that uses a DNN reward function as a non-convex stochastic optimization problem, and design a stage-wise stochastic gradient descent algorithm to optimize the problem and determine the action policy. We prove that with high probability, the action sequence chosen by this algorithm converges to a greedy action policy respecting a local optimal reward function. Extensive experiments have been performed to demonstrate the effectiveness and efficiency of the proposed algorithm on multiple real-world datasets.


page 1

page 2

page 3

page 4


Identifying Reward Functions using Anchor Actions

We propose a reward function estimation framework for inverse reinforcem...

Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit

Balancing exploration and exploitation (EE) is a fundamental problem in ...

An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling

We consider the contextual bandit problem, where a player sequentially m...

Online Stochastic Optimization with Wasserstein Based Non-stationarity

We consider a general online stochastic optimization problem with multip...

Contextual Bandits for adapting to changing User preferences over time

Contextual bandits provide an effective way to model the dynamic data pr...

BanditRank: Learning to Rank Using Contextual Bandits

We propose an extensible deep learning method that uses reinforcement le...

Interaction-Grounded Learning with Action-inclusive Feedback

Consider the problem setting of Interaction-Grounded Learning (IGL), in ...