Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning

06/15/2023
by   Amin Karbasi, et al.
0

Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched Langevin Thompson Sampling algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of 𝒪(log T) for stochastic MABs, and 𝒪(√(T)) for RL. We complement our theoretical findings with experimental results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2021

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

Contextual multi-armed bandits are classical models in reinforcement lea...
research
08/17/2022

Sampling Through the Lens of Sequential Decision Making

Sampling is ubiquitous in machine learning methodologies. Due to the gro...
research
05/10/2020

Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Artificial behavioral agents are often evaluated based on their consiste...
research
02/16/2021

The Randomized Elliptical Potential Lemma with an Application to Linear Thompson Sampling

In this note, we introduce a randomized version of the well-known ellipt...
research
06/21/2020

On Optimism in Model-Based Reinforcement Learning

The principle of optimism in the face of uncertainty is prevalent throug...
research
10/12/2019

Thompson Sampling in Non-Episodic Restless Bandits

Restless bandit problems assume time-varying reward distributions of the...
research
04/13/2020

Distributed Learning: Sequential Decision Making in Resource-Constrained Environments

We study cost-effective communication strategies that can be used to imp...

Please sign up or login with your details

Forgot password? Click here to reset