Policy Evaluation Networks

02/26/2020
by   Jean Harb, et al.
6

Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a single policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2022

General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

Learning to evaluate and improve policies is a core problem of Reinforce...
research
04/21/2022

PG3: Policy-Guided Planning for Generalized Policy Generation

A longstanding objective in classical planning is to synthesize policies...
research
05/21/2014

Off-Policy Shaping Ensembles in Reinforcement Learning

Recent advances of gradient temporal-difference methods allow to learn o...
research
04/20/2002

Learning from Scarce Experience

Searching the space of policies directly for the optimal policy has been...
research
01/17/2022

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforce...
research
05/07/2020

Cascade Attribute Network: Decomposing Reinforcement Learning Control Policies using Hierarchical Neural Networks

Reinforcement learning methods have been developed to achieve great succ...
research
04/08/2022

Approximate discounting-free policy evaluation from transient and recurrent states

In order to distinguish policies that prescribe good from bad actions in...

Please sign up or login with your details

Forgot password? Click here to reset