Online Learning in Adversarial MDPs: Is the Communicating Case Harder than Ergodic?

11/03/2021
by   Gautam Chandrasekaran, et al.
0

We study online learning in adversarial communicating Markov Decision Processes with full information. We give an algorithm that achieves a regret of O(√(T)) with respect to the best fixed deterministic policy in hindsight when the transitions are deterministic. We also prove a regret lower bound in this setting which is tight up to polynomial factors in the MDP parameters. We also give an inefficient algorithm that achieves O(√(T)) regret in communicating MDPs (with an additional mild restriction on the transition dynamics).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset