RL for Latent MDPs: Regret Guarantees and a Lower Bound

02/09/2021
by   Jeongyeol Kwon, et al.
0

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Ω((SA)^M) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., Boots et al.) and a reachability assumption, we show that the need for initialization can be removed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2014

Near-optimal Reinforcement Learning in Factored MDPs

Any reinforcement learning algorithm that applies to all Markov decision...
research
06/24/2021

A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

We derive a novel asymptotic problem-dependent lower-bound for regret mi...
research
07/18/2021

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Policy optimization is a widely-used method in reinforcement learning. D...
research
10/27/2021

Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection

We study the role of the representation of state-action value functions ...
research
04/25/2023

Provable benefits of general coverage conditions in efficient online RL with function approximation

In online reinforcement learning (RL), instead of employing standard str...
research
07/22/2022

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

We present regret minimization algorithms for stochastic contextual MDPs...
research
03/22/2023

Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees

Although deep reinforcement learning (DRL) has many success stories, the...

Please sign up or login with your details

Forgot password? Click here to reset