Dynamic Memory for Interpretable Sequential Optimisation

06/28/2022
by   Srivas Chennu, et al.
0

Real-world applications of reinforcement learning for recommendation and experimentation faces a practical challenge: the relative reward of different bandit arms can evolve over the lifetime of the learning agent. To deal with these non-stationary cases, the agent must forget some historical knowledge, as it may no longer be relevant to minimise regret. We present a solution to handling non-stationarity that is suitable for deployment at scale, to provide business operators with automated adaptive optimisation. Our solution aims to provide interpretable learning that can be trusted by humans, whilst responding to non-stationarity to minimise regret. To this end, we develop an adaptive Bayesian learning agent that employs a novel form of dynamic memory. It enables interpretability through statistical hypothesis testing, by targeting a set point of statistical power when comparing rewards and adjusting its memory dynamically to achieve this power. By design, the agent is agnostic to different kinds of non-stationarity. Using numerical simulations, we compare its performance against an existing proposal and show that, under multiple non-stationary scenarios, our agent correctly adapts to real changes in the true rewards. In all bandit solutions, there is an explicit trade-off between learning and achieving maximal performance. Our solution sits on a different point on this trade-off when compared to another similarly robust approach: we prioritise interpretability, which relies on more learning, at the cost of some regret. We describe the architecture of a large-scale deployment of automatic optimisation-as-a-service where our agent achieves interpretability whilst adapting to changing circumstances.

READ FULL TEXT
research
02/23/2023

A Definition of Non-Stationary Bandits

The subject of non-stationary bandit learning has attracted much recent ...
research
08/05/2017

Efficient Contextual Bandits in Non-stationary Worlds

Most contextual bandit algorithms minimize regret to the best fixed poli...
research
07/18/2023

Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related Rewards

Sequential decision-making under uncertainty is often associated with lo...
research
10/09/2019

Derivative-Free Order-Robust Optimisation

In this paper, we formalise order-robust optimisation as an instance of ...
research
06/21/2021

Smooth Sequential Optimisation with Delayed Feedback

Stochastic delays in feedback lead to unstable sequential learning using...
research
10/22/2021

Break your Bandit Routine with LSD Rewards: a Last Switch Dependent Analysis of Satiation and Seasonality

Motivated by the fact that humans like some level of unpredictability or...
research
03/21/2023

Adaptive Experimentation at Scale: Bayesian Algorithms for Flexible Batches

Standard bandit algorithms that assume continual reallocation of measure...

Please sign up or login with your details

Forgot password? Click here to reset