Actor-Critic based Improper Reinforcement Learning

07/19/2022
by   Mohammadi Zaki, et al.
0

We consider an improper reinforcement learning setting where a learner is given M base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2021

Improper Learning with Gradient-based Policy Optimization

We consider an improper reinforcement learning setting where the learner...
research
09/25/2021

Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms

The hierarchical interaction between the actor and critic in actor-criti...
research
07/04/2012

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

We consider the estimation of the policy gradient in partially observabl...
research
02/20/2022

Learning to Control Partially Observed Systems with Finite Memory

We consider the reinforcement learning problem for partially observed Ma...
research
06/02/2021

On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

In this paper, we study the convergence properties of off-policy policy ...
research
08/04/2023

Synthesizing Programmatic Policies with Actor-Critic Algorithms and ReLU Networks

Programmatically Interpretable Reinforcement Learning (PIRL) encodes pol...
research
10/28/2021

Bayesian Sequential Optimal Experimental Design for Nonlinear Models Using Policy Gradient Reinforcement Learning

We present a mathematical framework and computational methods to optimal...

Please sign up or login with your details

Forgot password? Click here to reset