Off-Policy Evaluation via Off-Policy Classification

06/04/2019
by   Alex Irpan, et al.
5

In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.

READ FULL TEXT

page 2

page 6

page 17

page 18

research
07/17/2020

Hyperparameter Selection for Offline Reinforcement Learning

Offline reinforcement learning (RL purely from logged data) is an import...
research
09/27/2020

Predicting Sim-to-Real Transfer with Probabilistic Dynamics Models

We propose a method to predict the sim-to-real transfer performance of R...
research
05/20/2018

Learning Real-World Robot Policies by Dreaming

Learning to control robots directly based on images is a primary challen...
research
09/04/2023

Marginalized Importance Sampling for Off-Environment Policy Evaluation

Reinforcement Learning (RL) methods are typically sample-inefficient, ma...
research
06/02/2023

Deep Q-Learning versus Proximal Policy Optimization: Performance Comparison in a Material Sorting Task

This paper presents a comparison between two well-known deep Reinforceme...
research
04/20/2002

Learning from Scarce Experience

Searching the space of policies directly for the optimal policy has been...
research
09/17/2021

Classification-based Quality Estimation: Small and Efficient Models for Real-world Applications

Sentence-level Quality estimation (QE) of machine translation is traditi...

Please sign up or login with your details

Forgot password? Click here to reset