Improper Learning with Gradient-based Policy Optimization

02/16/2021
by   Mohammadi Zaki, et al.
0

We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. We derive convergence and convergence rate guarantees for the approach assuming access to a gradient oracle. Numerical results on a challenging constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when each constituent policy at its disposal is unstable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/19/2022

Actor-Critic based Improper Reinforcement Learning

We consider an improper reinforcement learning setting where a learner i...
research
10/31/2021

Fast Global Convergence of Policy Optimization for Constrained MDPs

We address the issue of safety in reinforcement learning. We pose the pr...
research
08/17/2023

Controlling Federated Learning for Covertness

A learner aims to minimize a function f by repeatedly querying a distrib...
research
02/22/2023

Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes

The classical algorithms used in tabular reinforcement learning (Value I...
research
03/17/2021

Near Optimal Policy Optimization via REPS

Since its introduction a decade ago, relative entropy policy search (REP...
research
12/27/2017

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for pol...
research
04/19/2021

Distributed Derivative-free Learning Method for Stochastic Optimization over a Network with Sparse Activity

This paper addresses a distributed optimization problem in a communicati...

Please sign up or login with your details

Forgot password? Click here to reset