Safe Policy Improvement with Baseline Bootstrapping

12/19/2017 ∙ by Romain Laroche, et al. ∙ 0

A common goal in Reinforcement Learning is to derive a good strategy given a limited batch of data. In this paper, we adopt the safe policy improvement (SPI) approach: we compute a target policy guaranteed to perform at least as well as a given baseline policy. Our SPI strategy, inspired by the knows-what-it-knows paradigms, consists in bootstrapping the target policy with the baseline policy when it does not know. We develop two computationally efficient bootstrapping algorithms, a value-based and a policy-based, both accompanied with theoretical SPI bounds. Three algorithm variants are proposed. We empirically show the literature algorithms limits on a small stochastic gridworld problem, and then demonstrate that our five algorithms not only improve the worst case scenarios, but also the mean performance.



There are no comments yet.


page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.