Reward-Mixing MDPs with a Few Latent Contexts are Learnable

10/05/2022
by   Jeongyeol Kwon, et al.
0

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among M candidates and an agent interacts with the MDP throughout the episode for H time steps. Our goal is to learn a near-optimal policy that nearly maximizes the H time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for M=2. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary M≥2, we provide a sample-efficient algorithm–^2–that outputs an ϵ-optimal policy using Õ(ϵ^-2· S^d A^d ·(H, Z)^d ) episodes, where S, A are the number of states and actions respectively, H is the time-horizon, Z is the support size of reward distributions and d=min(2M-1,H). Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of (SA)^Ω(√(M)) / ϵ^2 for a general instance of RMMDP, supporting that super-polynomial sample complexity in M is necessary.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2021

Reinforcement Learning in Reward-Mixing MDPs

Learning a near optimal policy in a partially observable system remains ...
research
01/30/2022

Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms

Motivated by online recommendation systems, we propose the problem of fi...
research
02/27/2020

Learning in Markov Decision Processes under Constraints

We consider reinforcement learning (RL) in Markov Decision Processes (MD...
research
08/17/2022

Nearly Optimal Latent State Decoding in Block MDPs

We investigate the problems of model estimation and reward-free learning...
research
05/23/2019

Average reward reinforcement learning with unknown mixing times

We derive and analyze learning algorithms for policy evaluation, apprent...
research
10/05/2022

Tractable Optimality in Episodic Latent MABs

We consider a multi-armed bandit problem with M latent contexts, where a...
research
07/01/2023

Provably Efficient UCB-type Algorithms For Learning Predictive State Representations

The general sequential decision-making problem, which includes Markov de...

Please sign up or login with your details

Forgot password? Click here to reset