Large Scale Markov Decision Processes with Changing Rewards

05/25/2019

∙

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of O( √(τ (|S|+|A|)T)(T)), where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm's computational complexity is polynomial in |S| and |A| per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d≪|S|, we propose a modified algorithm with computational complexity polynomial in d. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first Õ(√(T)) regret bound for large scale MDPs with changing rewards.

READ FULL TEXT

Large Scale Markov Decision Processes with Changing Rewards

Sign in with Google

Consider DeepAI Pro