Large Scale Markov Decision Processes with Changing Rewards

05/25/2019
by   Adrian Rivera Cardoso, et al.
0

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of O( √(τ (|S|+|A|)T)(T)), where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm's computational complexity is polynomial in |S| and |A| per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d≪|S|, we propose a modified algorithm with computational complexity polynomial in d. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first Õ(√(T)) regret bound for large scale MDPs with changing rewards.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2019

Online Convex Optimization in Adversarial Markov Decision Processes

We consider online learning in episodic loop-free Markov decision proces...
research
12/11/2018

Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

We introduce and analyse two algorithms for exploration-exploitation in ...
research
06/15/2012

Simple Regret Optimization in Online Planning for Markov Decision Processes

We consider online planning in Markov decision processes (MDPs). In onli...
research
11/29/2019

Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation

This paper proposes a formal approach to learning and planning for agent...
research
02/15/2021

Causal Markov Decision Processes: Learning Good Interventions Efficiently

We introduce causal Markov Decision Processes (C-MDPs), a new formalism ...
research
11/01/2021

Intervention Efficient Algorithm for Two-Stage Causal MDPs

We study Markov Decision Processes (MDP) wherein states correspond to ca...
research
07/03/2020

Online learning in MDPs with linear function approximation and bandit feedback

We consider an online learning problem where the learner interacts with ...

Please sign up or login with your details

Forgot password? Click here to reset