Exploration in Model-based Reinforcement Learning with Randomized Reward

01/09/2023
by   Lingxiao Wang, et al.
0

Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2019

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

This paper studies a recent proposal to use randomized value functions t...
research
10/23/2020

Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration

This paper studies regret minimization with randomized value functions i...
research
06/15/2021

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

We propose a model-free reinforcement learning algorithm inspired by the...
research
02/19/2021

Randomized Exploration is Near-Optimal for Tabular MDP

We study exploration using randomized value functions in Thompson Sampli...
research
07/16/2020

PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning

Direct policy gradient methods for reinforcement learning are a successf...
research
07/26/2016

The Price of Anarchy in Auctions

This survey outlines a general and modular theory for proving approximat...
research
11/18/2019

TaskShuffler++: Real-Time Schedule Randomization for Reducing Worst-Case Vulnerability to Timing Inference Attacks

This paper presents a schedule randomization algorithm that reduces the ...

Please sign up or login with your details

Forgot password? Click here to reset