Almost Boltzmann Exploration

01/25/2019
by   Harsh Gupta, et al.
0

Boltzmann exploration is widely used in reinforcement learning to provide a trade-off between exploration and exploitation. Recently, in (Cesa-Bianchi et al., 2017) it has been shown that pure Boltzmann exploration does not perform well from a regret perspective, even in the simplest setting of stochastic multi-armed bandit (MAB) problems. In this paper, we show that a simple modification to Boltzmann exploration, motivated by a variation of the standard doubling trick, achieves O(K^1+α T) regret for a stochastic MAB problem with K arms, where α>0 is a parameter of the algorithm. This improves on the result in (Cesa-Bianchi et al., 2017), where an algorithm inspired by the Gumbel-softmax trick achieves O(K^2 T) regret. We also show that our algorithm achieves O(β(G) ^1+α T) regret in stochastic MAB problems with graph-structured feedback, without knowledge of the graph structure, where β(G) is the independence number of the feedback graph. Additionally, we present extensive experimental results on real datasets and applications for multi-armed bandits with both traditional bandit feedback and graph-structured feedback. In all cases, our algorithm performs as well or better than the state-of-the-art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/20/2020

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

EXP-based algorithms are often used for exploration in multi-armed bandi...
research
07/22/2011

Robustness of Anytime Bandit Policies

This paper studies the deviations of the regret in a stochastic multi-ar...
research
10/18/2018

Exploiting Correlation in Finite-Armed Structured Bandits

We consider a correlated multi-armed bandit problem in which rewards of ...
research
09/20/2022

Multi-armed Bandit Learning on a Graph

The multi-armed bandit(MAB) problem is a simple yet powerful framework t...
research
05/29/2017

Boltzmann Exploration Done Right

Boltzmann exploration is a classic strategy for sequential decision-maki...
research
02/03/2019

A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

We propose the first contextual bandit algorithm that is parameter-free,...
research
07/07/2020

Optimal Strategies for Graph-Structured Bandits

We study a structured variant of the multi-armed bandit problem specifie...

Please sign up or login with your details

Forgot password? Click here to reset