Policy Mirror Descent Inherently Explores Action Space

03/08/2023
by   Yan Li, et al.
0

Designing computationally efficient exploration strategies for on-policy first-order methods that attain optimal 𝒪(1/ϵ^2) sample complexity remains open for solving Markov decision processes (MDP). This manuscript provides an answer to this question from a perspective of simplicity, by showing that whenever exploration over the state space is implied by the MDP structure, there seems to be little need for sophisticated exploration strategies. We revisit a stochastic policy gradient method, named stochastic policy mirror descent, applied to the infinite horizon, discounted MDP with finite state and action spaces. Accompanying SPMD we present two on-policy evaluation operators, both simply following the policy for trajectory collection with no explicit exploration, or any form of intervention. SPMD with the first evaluation operator, named value-based estimation, tailors to the Kullback-Leibler (KL) divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an 𝒪̃( 1 / ϵ^2) sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, named truncated on-policy Monte Carlo, attains an 𝒪̃(ℋ_𝒟 / ϵ^2) sample complexity, with the same assumption on the state chains of generated policies. We characterize ℋ_𝒟 as a divergence-dependent function of the effective horizon and the size of the action space, which leads to an exponential dependence of the latter two quantities for the KL divergence, and a polynomial dependence for the divergence induced by negative Tsallis entropy. These obtained sample complexities seem to be new among on-policy stochastic policy gradient methods without explicit explorations.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset