A Short Note on Soft-max and Policy Gradients in Bandits Problems

07/20/2020
by   Neil Walton, et al.
0

This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning <cit.>. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case <cit.>. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/06/2021

Smoothed functional-based gradient algorithms for off-policy reinforcement learning

We consider the problem of control in an off-policy reinforcement learni...
research
12/22/2017

A short variational proof of equivalence between policy gradients and soft Q learning

Two main families of reinforcement learning algorithms, Q-learning and p...
research
03/06/2023

Lower Bounds for γ-Regret via the Decision-Estimation Coefficient

In this note, we give a new lower bound for the γ-regret in bandit probl...
research
07/20/2020

Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

We consider a policy gradient algorithm applied to a finite-arm bandit p...
research
03/10/2021

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

We analyze the DQN reinforcement learning algorithm as a stochastic appr...
research
02/17/2020

Differentiable Bandit Exploration

We learn bandit policies that maximize the average reward over bandit in...
research
02/19/2018

Fourier Policy Gradients

We propose a new way of deriving policy gradient updates for reinforceme...

Please sign up or login with your details

Forgot password? Click here to reset