Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

12/03/2019

∙

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves Õ(L|X|^2√(|A|T)) regret with high probability, where L is the horizon, |X| is the number of states, |A| is the number of actions, and T is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure sub-linear regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an upper occupancy bound.

READ FULL TEXT

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Sign in with Google

Consider DeepAI Pro