Randomised Bayesian Least-Squares Policy Iteration

04/06/2019
by   Nikolaos Tziortziotis, et al.
0

We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy, model-free, policy iteration algorithm that uses the Bayesian least-squares temporal-difference (BLSTD) learning algorithm to evaluate policies. An online variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that improves its policy based on an incomplete policy evaluation step. In online setting, the exploration-exploitation dilemma should be addressed as we try to discover the optimal policy by using samples collected by ourselves. RBLSPI exploits the advantage of BLSTD to quantify our uncertainty about the value function. Inspired by Thompson sampling, RBLSPI first samples a value function from a posterior distribution over value functions, and then selects actions based on the sampled value function. The effectiveness and the exploration abilities of RBLSPI are demonstrated experimentally in several environments.

READ FULL TEXT

page 8

page 13

page 14

page 17

research
04/17/2017

O^2TD: (Near)-Optimal Off-Policy TD Learning

Temporal difference learning and Residual Gradient methods are the most ...
research
01/06/2016

Angrier Birds: Bayesian reinforcement learning

We train a reinforcement learner to play a simplified version of the gam...
research
06/15/2021

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

We propose a model-free reinforcement learning algorithm inspired by the...
research
09/13/2019

ISL: Optimal Policy Learning With Optimal Exploration-Exploitation Trade-Off

Traditionally, off-policy learning algorithms (such as Q-learning) and e...
research
01/30/2013

Solving POMDPs by Searching in Policy Space

Most algorithms for solving POMDPs iteratively improve a value function ...
research
04/08/2022

Approximate discounting-free policy evaluation from transient and recurrent states

In order to distinguish policies that prescribe good from bad actions in...
research
12/09/2017

Bayesian Q-learning with Assumed Density Filtering

While off-policy temporal difference methods have been broadly used in r...

Please sign up or login with your details

Forgot password? Click here to reset