Posterior Sampling for Reinforcement Learning Without Episodes

08/09/2016
by   Ian Osband, et al.
0

This is a brief technical note to clarify some of the issues with applying the application of the algorithm posterior sampling for reinforcement learning (PSRL) in environments without fixed episodes. In particular, this paper aims to: - Review some of results which have been proven for finite horizon MDPs (Osband et al 2013, 2014a, 2014b, 2016) and also for MDPs with finite ergodic structure (Gopalan et al 2014). - Review similar results for optimistic algorithms in infinite horizon problems (Jaksch et al 2010, Bartlett and Tewari 2009, Abbasi-Yadkori and Szepesvari 2011), with particular attention to the dynamic episode growth. - Highlight the delicate technical issue which has led to a fault in the proof of the lazy-PSRL algorithm (Abbasi-Yadkori and Szepesvari 2015). We present an explicit counterexample to this style of argument. Therefore, we suggest that the Theorem 2 in (Abbasi-Yadkori and Szepesvari 2015) be instead considered a conjecture, as it has no rigorous proof. - Present pragmatic approaches to apply PSRL in infinite horizon problems. We conjecture that, under some additional assumptions, it will be possible to obtain bounds O( √(T) ) even without episodic reset. We hope that this note serves to clarify existing results in the field of reinforcement learning and provides interesting motivation for future work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/09/2016

On Lower Bounds for Regret in Reinforcement Learning

This is a brief technical note to clarify the state of lower bounds on r...
research
10/07/2021

A Model Selection Approach for Corruption Robust Reinforcement Learning

We develop a model selection approach to tackle reinforcement learning w...
research
08/19/2021

A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systems

We revisit the Thompson sampling algorithm to control an unknown linear ...
research
11/02/2020

A Variant of the Wang-Foster-Kakade Lower Bound for the Discounted Setting

Recently, Wang et al. (2020) showed a highly intriguing hardness result ...
research
07/01/2016

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Computational results demonstrate that posterior sampling for reinforcem...
research
07/23/2020

Challenging common bolus advisor for self-monitoring type-I diabetes patients using Reinforcement Learning

Patients with diabetes who are self-monitoring have to decide right befo...

Please sign up or login with your details

Forgot password? Click here to reset