Smooth Sequential Optimisation with Delayed Feedback

06/21/2021
by   Srivas Chennu, et al.
0

Stochastic delays in feedback lead to unstable sequential learning using multi-armed bandits. Recently, empirical Bayesian shrinkage has been shown to improve reward estimation in bandit learning. Here, we propose a novel adaptation to shrinkage that estimates smoothed reward estimates from windowed cumulative inputs, to deal with incomplete knowledge from delayed feedback and non-stationary rewards. Using numerical simulations, we show that this adaptation retains the benefits of shrinkage, and improves the stability of reward estimation by more than 50 treatment allocations to the best arm by up to 3.8x, and improves statistical accuracy - with up to 8 in false positive rates. Together, these advantages enable control of the trade-off between speed and stability of adaptation, and facilitate human-in-the-loop sequential optimisation.

READ FULL TEXT

page 5

page 7

research
02/22/2019

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

A survey is performed of various Multi-Armed Bandit (MAB) strategies in ...
research
01/17/2022

A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits

We study the non-stationary stochastic multi-armed bandit problem, where...
research
04/28/2022

Multi-Player Multi-Armed Bandits with Finite Shareable Resources Arms: Learning Algorithms Applications

Multi-player multi-armed bandits (MMAB) study how decentralized players ...
research
06/04/2021

Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions

We study the stochastic Multi-Armed Bandit (MAB) problem with random del...
research
06/28/2022

Dynamic Memory for Interpretable Sequential Optimisation

Real-world applications of reinforcement learning for recommendation and...
research
12/24/2021

Gaussian Process Bandits with Aggregated Feedback

We consider the continuum-armed bandits problem, under a novel setting o...
research
02/05/2018

Wireless Optimisation via Convex Bandits: Unlicensed LTE/WiFi Coexistence

Bandit Convex Optimisation (BCO) is a powerful framework for sequential ...

Please sign up or login with your details

Forgot password? Click here to reset