On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

06/15/2021
by   Amrit Singh Bedi, et al.
0

Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.

READ FULL TEXT
research
01/28/2022

On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces

We focus on parameterized policy search for reinforcement learning over ...
research
06/04/2020

Policy Learning of MDPs with Mixed Continuous/Discrete Variables: A Case Study on Model-Free Control of Markovian Jump Systems

Markovian jump linear systems (MJLS) are an important class of dynamical...
research
06/13/2022

Relative Policy-Transition Optimization for Fast Policy Transfer

We consider the problem of policy transfer between two Markov Decision P...
research
01/24/2022

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

We propose the homotopic policy mirror descent (HPMD) method for solving...
research
08/01/2018

Robbins-Mobro conditions for persistent exploration learning strategies

We formulate simple assumptions, implying the Robbins-Monro conditions f...
research
02/22/2021

Softmax Policy Gradient Methods Can Take Exponential Time to Converge

The softmax policy gradient (PG) method, which performs gradient ascent ...
research
11/24/2016

Multiscale Inverse Reinforcement Learning using Diffusion Wavelets

This work presents a multiscale framework to solve an inverse reinforcem...

Please sign up or login with your details

Forgot password? Click here to reset