Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes

02/22/2023

∙

The classical algorithms used in tabular reinforcement learning (Value Iteration and Policy Iteration) have been shown to converge linearly with a rate given by the discount factor γ of a discounted Markov Decision Process. Recently, there has been an increased interest in the study of gradient based methods. In this work, we show that the dimension-free linear γ-rate of classical reinforcement learning algorithms can be achieved by a general family of unregularised Policy Mirror Descent (PMD) algorithms under an adaptive step-size. We also provide a matching worst-case lower-bound that demonstrates that the γ-rate is optimal for PMD methods. Our work offers a novel perspective on the convergence of PMD. We avoid the use of the performance difference lemma beyond establishing the monotonic improvement of the iterates, which leads to a simple analysis that may be of independent interest. We also extend our analysis to the inexact setting and establish the first dimension-free ε-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.

READ FULL TEXT

Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes

Sign in with Google

Consider DeepAI Pro