Towards Combining On-Off-Policy Methods for Real-World Applications

04/24/2019
by   Kai-Chun Hu, et al.
7

In this paper, we point out a fundamental property of the objective in reinforcement learning, with which we can reformulate the policy gradient objective into a perceptron-like loss function, removing the need to distinguish between on and off policy training. Namely, we posit that it is sufficient to only update a policy π for cases that satisfy the condition A(π/μ-1)≤0, where A is the advantage, and μ is another policy. Furthermore, we show via theoretic derivation that a perceptron-like loss function matches the clipped surrogate objective for PPO. With our new formulation, the policies π and μ can be arbitrarily apart in theory, effectively enabling off-policy training. To examine our derivations, we can combine the on-policy PPO clipped surrogate (which we show to be equivalent with one instance of the new reformation) with the off-policy IMPALA method. We first verify the combined method on the OpenAI Gym pendulum toy problem. Next, we use our method to train a quadrotor position controller in a simulator. Our trained policy is efficient and lightweight enough to perform in a low cost micro-controller at a minimum update rate of 500 Hz. For the quadrotor, we show two experiments to verify our method and demonstrate performance: 1) hovering at a fixed position, and 2) tracking along a specific trajectory. In preliminary trials, we are also able to apply the method to a real-world quadrotor.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/2023

Feedback is All You Need: Real-World Reinforcement Learning with Approximate Physics-Based Models

We focus on developing efficient and reliable policy optimization strate...
research
12/12/2018

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) has been applied successfully to many ...
research
09/22/2021

MEPG: A Minimalist Ensemble Policy Gradient Framework for Deep Reinforcement Learning

Ensemble reinforcement learning (RL) aims to mitigate instability in Q-l...
research
10/26/2021

Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPO

Policy optimization is a fundamental principle for designing reinforceme...
research
12/04/2019

AlgaeDICE: Policy Gradient from Arbitrary Experience

In many real-world applications of reinforcement learning (RL), interact...
research
07/02/2020

On the Outsized Importance of Learning Rates in Local Update Methods

We study a family of algorithms, which we refer to as local update metho...

Please sign up or login with your details

Forgot password? Click here to reset