Potential-Based Advice for Stochastic Policy Learning

07/20/2019
by   Baicen Xiao, et al.
6

This paper augments the reward received by a reinforcement learning agent with potential functions in order to help the agent learn (possibly stochastic) optimal policies. We show that a potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and demonstrate that the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. We propose a method to impart potential based advice schemes to policy gradient algorithms. An algorithm that considers an advantage actor-critic architecture augmented with this scheme is proposed, and we give guarantees on its convergence. Finally, we evaluate our approach on a puddle-jump grid world with indistinguishable states, and the continuous state and action mountain car environment from classical control. Our results indicate that these schemes allow the agent to learn a stochastic optimal policy faster and obtain a higher average reward.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

research
05/20/2023

Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

The average reward criterion is relatively less studied as most existing...
research
02/27/2017

Reinforcement Learning with Deep Energy-Based Policies

We propose a method for learning expressive energy-based policies for co...
research
11/02/2020

Useful Policy Invariant Shaping from Arbitrary Advice

Reinforcement learning is a powerful learning paradigm in which agents c...
research
04/06/2020

Uniform State Abstraction For Reinforcement Learning

Potential Based Reward Shaping combined with a potential function based ...
research
02/07/2020

Off-policy Maximum Entropy Reinforcement Learning : Soft Actor-Critic with Advantage Weighted Mixture Policy(SAC-AWMP)

The optimal policy of a reinforcement learning problem is often disconti...
research
09/18/2022

DeepTOP: Deep Threshold-Optimal Policy for MDPs and RMABs

We consider the problem of learning the optimal threshold policy for con...
research
05/28/2021

A nearly Blackwell-optimal policy gradient method

For continuing environments, reinforcement learning methods commonly max...

Please sign up or login with your details

Forgot password? Click here to reset