Mirror Learning: A Unifying Framework of Policy Optimisation

01/07/2022
by   Jakub Grudzien Kuba, et al.
0

General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL), which serve as the core models for solving Markov decision processes (MDPs). Unfortunately, in their mathematical form, they are sensitive to modifications, and thus, the practical instantiations that implement them do not automatically inherit their improvement guarantees. As a result, the spectrum of available rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we propose mirror learning – a general solution to the RL problem. We reveal GPI and TRL to be but small points within this far greater space of algorithms which boasts the monotonic improvement property and converges to the optimal policy. We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate analogies. Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.

READ FULL TEXT

page 6

page 7

research
11/28/2018

A Structure-aware Online Learning Algorithm for Markov Decision Processes

To overcome the curse of dimensionality and curse of modeling in Dynamic...
research
05/02/2023

Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees

Linear Temporal Logic (LTL) is widely used to specify high-level objecti...
research
05/01/2022

Processing Network Controls via Deep Reinforcement Learning

Novel advanced policy gradient (APG) algorithms, such as proximal policy...
research
10/11/2022

Discovered Policy Optimisation

Tremendous progress has been made in reinforcement learning (RL) over th...
research
02/14/2022

SAUTE RL: Almost Surely Safe Reinforcement Learning Using State Augmentation

Satisfying safety constraints almost surely (or with probability one) ca...
research
12/17/2021

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)

We consider the challenge of policy simplification and verification in t...
research
05/26/2022

Approximate Q-learning and SARSA(0) under the ε-greedy Policy: a Differential Inclusion Analysis

Q-learning and SARSA(0) with linear function approximation, under ϵ-gree...

Please sign up or login with your details

Forgot password? Click here to reset