Policy Mirror Descent Inherently Explores Action Space

03/08/2023
by   Yan Li, et al.
0

Designing computationally efficient exploration strategies for on-policy first-order methods that attain optimal 𝒪(1/ϵ^2) sample complexity remains open for solving Markov decision processes (MDP). This manuscript provides an answer to this question from a perspective of simplicity, by showing that whenever exploration over the state space is implied by the MDP structure, there seems to be little need for sophisticated exploration strategies. We revisit a stochastic policy gradient method, named stochastic policy mirror descent, applied to the infinite horizon, discounted MDP with finite state and action spaces. Accompanying SPMD we present two on-policy evaluation operators, both simply following the policy for trajectory collection with no explicit exploration, or any form of intervention. SPMD with the first evaluation operator, named value-based estimation, tailors to the Kullback-Leibler (KL) divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an 𝒪̃( 1 / ϵ^2) sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, named truncated on-policy Monte Carlo, attains an 𝒪̃(ℋ_𝒟 / ϵ^2) sample complexity, with the same assumption on the state chains of generated policies. We characterize ℋ_𝒟 as a divergence-dependent function of the effective horizon and the size of the action space, which leads to an exponential dependence of the latter two quantities for the KL divergence, and a polynomial dependence for the divergence induced by negative Tsallis entropy. These obtained sample complexities seem to be new among on-policy stochastic policy gradient methods without explicit explorations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2022

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

We propose the homotopic policy mirror descent (HPMD) method for solving...
research
06/18/2021

On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

We study the fundamental question of the sample complexity of learning a...
research
02/15/2023

Optimal Sample Complexity of Reinforcement Learning for Uniformly Ergodic Discounted Markov Decision Processes

We consider the optimal sample complexity theory of tabular reinforcemen...
research
05/11/2022

Stochastic first-order methods for average-reward Markov decision processes

We study the problem of average-reward Markov decision processes (AMDPs)...
research
09/21/2022

First-order Policy Optimization for Robust Markov Decision Process

We consider the problem of solving robust Markov decision process (MDP),...
research
05/30/2023

Sharp high-probability sample complexities for policy evaluation with linear function approximation

This paper is concerned with the problem of policy evaluation with linea...
research
06/04/2020

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Asynchronous Q-learning aims to learn the optimal action-value function ...

Please sign up or login with your details

Forgot password? Click here to reset