Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

09/20/2020
by   Mengfan Xu, et al.
9

EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/19/2022

Thompson Sampling on Asymmetric α-Stable Bandits

In algorithm optimization in reinforcement learning, how to deal with th...
research
04/25/2012

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Multi-armed bandit problems are the most basic examples of sequential de...
research
02/17/2015

Regret bounds for Narendra-Shapiro bandit algorithms

Narendra-Shapiro (NS) algorithms are bandit-type algorithms that have be...
research
02/15/2023

Bandit Social Learning: Exploration under Myopic Behavior

We study social learning dynamics where the agents collectively follow a...
research
06/28/2021

Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits

We introduce the "inverse bandit" problem of estimating the rewards of a...
research
09/23/2020

EXP4-DFDC: A Non-Stochastic Multi-Armed Bandit for Cache Replacement

In this work we study a variant of the well-known multi-armed bandit (MA...
research
01/25/2019

Almost Boltzmann Exploration

Boltzmann exploration is widely used in reinforcement learning to provid...

Please sign up or login with your details

Forgot password? Click here to reset