Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

07/20/2020
by   Denis Denisov, et al.
0

We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/05/2020

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

We study the online restless bandit problem, where the state of each arm...
research
09/14/2020

Hellinger KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

In the regret-based formulation of multi-armed bandit (MAB) problems, ex...
research
04/17/2019

Off-Policy Policy Gradient with State Distribution Correction

We study the problem of off-policy policy optimization in Markov decisio...
research
09/12/2012

Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of ea...
research
10/13/2022

Policy Gradient With Serial Markov Chain Reasoning

We introduce a new framework that performs decision-making in reinforcem...
research
07/20/2020

A Short Note on Soft-max and Policy Gradients in Bandits Problems

This is a short communication on a Lyapunov function argument for softma...
research
01/29/2019

An accelerated variant of simulated annealing that converges under fast cooling

Given a target function U to minimize on a finite state space X, a propo...

Please sign up or login with your details

Forgot password? Click here to reset