Smoothed Action Value Functions for Learning Gaussian Policies

03/06/2018
by   Ofir Nachum, et al.
0

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships, we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/05/2020

Deep RBF Value Functions for Continuous Control

A core operation in reinforcement learning (RL) is finding an action tha...
research
02/01/2023

Bridging Physics-Informed Neural Networks with Reinforcement Learning: Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO)

This paper introduces the Hamilton-Jacobi-Bellman Proximal Policy Optimi...
research
08/03/2020

Proximal Deterministic Policy Gradient

This paper introduces two simple techniques to improve off-policy Reinfo...
research
11/05/2019

Quinoa: a Q-function You Infer Normalized Over Actions

We present an algorithm for learning an approximate action-value soft Q-...
research
03/03/2023

Eventual Discounting Temporal Logic Counterfactual Experience Replay

Linear temporal logic (LTL) offers a simplified way of specifying tasks ...
research
10/05/2018

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Proximal Policy Optimization (PPO) is a highly popular model-free reinfo...
research
09/04/2020

Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile

A method of a fusion of fuzzy inference and policy gradient reinforcemen...

Please sign up or login with your details

Forgot password? Click here to reset