DeepAI
Log In Sign Up

Improper Learning with Gradient-based Policy Optimization

02/16/2021
by   Mohammadi Zaki, et al.
0

We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. We derive convergence and convergence rate guarantees for the approach assuming access to a gradient oracle. Numerical results on a challenging constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when each constituent policy at its disposal is unstable.

READ FULL TEXT

page 1

page 2

page 3

page 4

07/19/2022

Actor-Critic based Improper Reinforcement Learning

We consider an improper reinforcement learning setting where a learner i...
10/31/2021

Fast Global Convergence of Policy Optimization for Constrained MDPs

We address the issue of safety in reinforcement learning. We pose the pr...
06/06/2018

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Temporal difference learning (TD) is a simple iterative algorithm used t...
03/17/2021

Near Optimal Policy Optimization via REPS

Since its introduction a decade ago, relative entropy policy search (REP...
12/27/2017

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for pol...
04/15/2013

Off-policy Learning with Eligibility Traces: A Survey

In the framework of Markov Decision Processes, off-policy learning, that...
06/10/2020

Robust Detection of Adaptive Spammers by Nash Reinforcement Learning

Online reviews provide product evaluations for customers to make decisio...