A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits

01/17/2022
by   Yasin Abbasi-Yadkori, et al.
0

We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference of the expected cumulative reward of an agent choosing the optimal arm in every round and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in K-armed bandit problems, a near-optimal O(√(K N(S+1))) dynamic regret, where N is the number of rounds and S is the number of times the identity of the optimal arm changes, without prior knowledge of S and N. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of S to achieve similar bounds.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2021

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

We propose a black-box reduction that turns a certain reinforcement lear...
research
08/20/2019

How to gamble with non-stationary X-armed bandits and have no regrets

In X-armed bandit problem an agent sequentially interacts with environme...
research
02/22/2019

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

A survey is performed of various Multi-Armed Bandit (MAB) strategies in ...
research
05/29/2022

Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret

The stochastic multi-armed bandit setting has been recently studied in t...
research
12/08/2017

On Adaptive Estimation for Dynamic Bernoulli Bandits

The multi-armed bandit (MAB) problem is a classic example of the explora...
research
06/21/2021

Smooth Sequential Optimisation with Delayed Feedback

Stochastic delays in feedback lead to unstable sequential learning using...
research
09/07/2016

Random Shuffling and Resets for the Non-stationary Stochastic Bandit Problem

We consider a non-stationary formulation of the stochastic multi-armed b...

Please sign up or login with your details

Forgot password? Click here to reset