Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

12/10/2022
by   Hsin-En Su, et al.
0

We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective – the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2019

Off-Policy Policy Gradient with State Distribution Correction

We study the problem of off-policy policy optimization in Markov decisio...
research
03/27/2019

Generalized Off-Policy Actor-Critic

We propose a new objective, the counterfactual objective, unifying exist...
research
12/04/2019

AlgaeDICE: Policy Gradient from Arbitrary Experience

In many real-world applications of reinforcement learning (RL), interact...
research
11/16/2019

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Off-policy deep reinforcement learning (RL) algorithms are incapable of ...
research
06/23/2023

Correcting discount-factor mismatch in on-policy policy gradient methods

The policy gradient theorem gives a convenient form of the policy gradie...
research
09/29/2021

Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates

The policy gradient theorem states that the policy should only be update...
research
11/12/2020

Steady State Analysis of Episodic Reinforcement Learning

This paper proves that the episodic learning environment of every finite...

Please sign up or login with your details

Forgot password? Click here to reset