Dynamic Planning and Learning under Recovering Rewards

06/28/2021
āˆ™
by   David Simchi-Levi, et al.
āˆ™
0
āˆ™

Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most K out of N different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over T time periods, we propose, construct and prove performance guarantees for a class of "Purely Periodic Policies". For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of 1-š’Ŗ(1/āˆš(K)), which is asymptotically optimal when K grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has š’Ŗ(Nāˆš(T)) regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
āˆ™ 05/16/2020

Learning and Optimization with Seasonal Patterns

Seasonality is a common form of non-stationary patterns in the business ...
research
āˆ™ 06/11/2020

Grooming a Single Bandit Arm

The stochastic multi-armed bandit problem captures the fundamental explo...
research
āˆ™ 08/19/2022

Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee for Improving Bandits

We study the Improving Multi-Armed Bandit (IMAB) problem, where the rewa...
research
āˆ™ 12/04/2020

One-bit feedback is sufficient for upper confidence bound policies

We consider a variant of the traditional multi-armed bandit problem in w...
research
āˆ™ 10/28/2022

Dynamic Bandits with an Auto-Regressive Temporal Structure

Multi-armed bandit (MAB) problems are mainly studied under two extreme s...
research
āˆ™ 10/22/2021

Break your Bandit Routine with LSD Rewards: a Last Switch Dependent Analysis of Satiation and Seasonality

Motivated by the fact that humans like some level of unpredictability or...
research
āˆ™ 03/04/2020

Odds-Ratio Thompson Sampling to Control for Time-Varying Effect

Multi-armed bandit methods have been used for dynamic experiments partic...

Please sign up or login with your details

Forgot password? Click here to reset