Online Learning in Adversarial MDPs: Is the Communicating Case Harder than Ergodic?
We study online learning in adversarial communicating Markov Decision Processes with full information. We give an algorithm that achieves a regret of O(√(T)) with respect to the best fixed deterministic policy in hindsight when the transitions are deterministic. We also prove a regret lower bound in this setting which is tight up to polynomial factors in the MDP parameters. We also give an inefficient algorithm that achieves O(√(T)) regret in communicating MDPs (with an additional mild restriction on the transition dynamics).
READ FULL TEXT