Off-Policy Evaluation via Off-Policy Classification

06/04/2019 ∙ by Alex Irpan, et al. ∙ Google 5

In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.



There are no comments yet.


page 2

page 6

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised learning has seen significant advances in recent years, in part due to the use of large, standardized datasets Deng et al. (2009). When researchers can evaluate real performance of their methods on the same data via a standardized offline metric, the progress of the field can be rapid. Unfortunately, such metrics have been lacking in reinforcement learning (RL). Model selection and performance evaluation in RL are typically done by estimating the average on-policy return of a method in the target environment. Although this is possible in most simulated environments (Todorov et al., 2012; Bellemare et al., 2013; Brockman et al., 2016), real-world environments, like in robotics, make this difficult and expensive (Thomas et al., 2015)

. Off-policy evaluation (OPE) has the potential to change that: a robust off-policy metric could be used together with realistic and complex data to evaluate the expected performance of off-policy RL methods, which would enable rapid progress on important real-world RL problems. Furthermore, it would greatly simplify transfer learning in RL, where OPE would enable model selection and algorithm design in simple domains (e.g., simulation) while evaluating the performance of these models and algorithms on complex domains (e.g., using previously collected real-world data).

Previous approaches to off-policy evaluation (Precup et al., 2000; Dudik et al., 2011; Jiang & Li, 2015; Thomas & Brunskill, 2016) generally use importance sampling (IS) or learned dynamics models. However, this makes them difficult to use with many modern deep RL algorithms. First, OPE is most useful in the off-policy RL setting, where we expect to use real-world data as the “validation set”, but many of the most commonly used off-policy RL methods are based on value function estimation, produce deterministic policies (Lillicrap et al., 2015; van Hasselt et al., 2016), and do not require any knowledge of the policy that generated the real-world training data. This makes them difficult to use with IS. Furthermore, many of these methods might be used with high-dimensional observations, such as images. Although there has been considerable progress in predicting future images (Babaeizadeh et al., 2018; Lee et al., 2018), learning sufficiently accurate models in image space for effective evaluation is still an open research problem. We therefore aim to develop an OPE method that requires neither IS nor models.

We observe that for model selection, it is sufficient to predict some statistic correlated with policy return, rather than directly predict policy return. We address the specific case of binary-reward MDPs: tasks where the agent receives a non-zero reward only once during an episode, at the final timestep (Sect. 2). These can be interpreted as tasks where the agent can either “succeed” or “fail” in each trial, and although they form a subset of all possible MDPs, this subset is quite representative of many real-world tasks, and is actively used e.g. in robotic manipulation (Kalashnikov et al., 2018; Riedmiller et al., 2018). The novel contribution of our method (Sect. 3) is to frame OPE as a positive-unlabeled (PU) classification Kiryo et al. (2017) problem, which provides for a way to derive OPE metrics that are both (a) fundamentally different from prior methods based on IS and model learning, and (b) perform well in practice on both simulated and real-world tasks. Additionally, we identify and present (Sect. LABEL:sec:generalization) a list of generalization scenarios in RL that we would want our metrics to be robust against. We experimentally show (Sect. 4) that our suggested OPE metrics outperform a variety of baseline methods across all of the evaluation scenarios, including a simulation-to-reality transfer scenario for a vision-based robotic grasping task (see Fig. 0(b)).

(a) Visual summary of off-policy metrics
(b) Robotics grasping simulation-to-reality gap.
Figure 1: (a) Visual illustration of our method: We propose using classification-based approaches to do off-policy evaluation. Solid curves represent over a positive and negative trajectory, with the dashed curve representing along states the positive trajectory visits (the corresponding negative curve is omitted for simplicity). Baseline approaches (blue, red) measure Q-function fit between to . Our approach (purple) directly measures separation of between positive and negative trajectories. (b) The visual “reality gap” of our most challenging task: off-policy evaluation of the generalization of image-based robotic agents trained solely in simulation (left) using historical data from the target real-world environment (right).

2 Preliminaries

We focus on finite–horizon, deterministic Markov decision processes (MDP). We define an MDP as (

). is the state–space, the action–space, and both can be continuous. defines transitions to next states given the current state and action, defines initial state distribution, is the reward function, and is the discount factor. Episodes are of finite length : at a given time-step the agent is at state , samples an action from a policy , receives a reward , and observes the next state as determined by .

The goal in RL is to learn a policy that maximizes the expected episode return . A value of a policy for a given state is defined as where is the action takes at state and implies an expectation over trajectories sampled from . Given a policy , the expected value of its action at a state is called the Q-value and is defined as .

We assume the MDP is a binary reward MDP, which satisfies the following properties: transitions are deterministic, , the reward is at all intermediate steps, and the final reward is in , indicating whether the final state is a failure or a success. We learn Q-functions and aim to evaluate policies .

2.1 Positive-unlabeled learning

Positive-unlabeled (PU) learning is a set of techniques for learning binary classification from partially labeled data, where we have many unlabeled points and some positively labeled points (Kiryo et al., 2017)

. We will make use of these ideas in developing our OPE metric. Positive-unlabeled data is sufficient to learn a binary classifier if the positive class prior

is known.

Let be a labeled binary classification problem, where . Let be some decision function, and let

be our loss function. Suppose we want to evaluate loss

over negative examples , but we only have unlabeled points and positively labeled points . The key insight of PU learning is that the loss over negatives can be indirectly estimated from . For any ,


It follows that for any , , since by definition . Letting and rearranging gives


In Sect. 3, we reduce off-policy evaluation of a policy to a classification problem, provide reasoning for how to estimate , use PU learning to estimate classification error with Eqn. 2, then use the error to estimate a lower bound on return with Theorem 1.

3 Off-policy evaluation via state-action pair classification

A Q-function predicts the expected return of each action given state . The policy can be viewed as a classifier that predicts the best action. We propose an off-policy evaluation method connecting off-policy evaluation to estimating validation error for a positive-unlabeled (PU) classification problem (Kiryo et al., 2017). Our metric can be used with Q-function estimation methods without requiring importance sampling, and can be readily applied in a scalable way to image-based deep RL tasks.

We present an analysis for binary reward MDPs, defined in Sec. 2. In a binary reward MDP, the deterministic dynamics means each is either potentially effective, or guaranteed to lead to failure.

Definition 1.

In a binary reward MDP, is effective if an optimal policy can achieve success, i.e an episode return of 1, after taking in . Equivalently, there exists a sequence of future actions that reaches a success state. is catastrophic if no such sequence exists.

Under this definition, the return of a trajectory is 1 if and only if all () in are effective (see Appendix A.1), the label for does not depend on the policy we are evaluating, and the classification error of at each time can be used to bound return .

Theorem 1.

Given a binary reward MDP and a policy , let denote the state distribution at time , given that was followed and all its previous actions were effective. Let denote the set of catastrophic actions at state , and let be the per-step expectation of making its first mistake at time , with being average error over all . Then , and this lower bound is tight.


Failure rate

is the total probability

makes its first mistake at time , summed over all . For each , the probability that follows effective actions then a catastrophic action is , giving . This gives , which is tight when . ∎

An alternative proof in Appendix A.3

is based on imitation learning behavioral cloning bounds from 

Ross & Bagnell (2010).

A smaller gives a higher lower bound on return, which implies a better . This leaves estimating . The primary challenge with this approach is that we do not have negative labels – that is, for trials that receive a return of 0 in the validation set, we do not know which were in fact catastrophic, and which were recoverable. We discuss how we address this problem next.

3.1 Missing negative labels

Recall that is effective if can succeed from . Since is at least as good as , whenever succeeds, all tuples in the trajectory must be effective. However, the converse is not true, since could succeed from where failed. This is an instance of the positive-unlabeled (PU) learning problem from Sect. 2.1, where positively labels some and the remaining are unlabeled. In the RL setting, , labels {catastrophic, effective}, and a natural choice for decision function is , since should be high for effective and low for catastrophic .

We aim to estimate , the probability that takes a catastrophic action – i.e., that is a false positive. Note that if is predicted to be catastrophic, but is actually effective, this false-negative does not impact future reward – since the action is effective, there is still a path to reach success. We want just the false-positive risk, . This is the same as Eqn. 2, and using gives


Eqn. 3 is the core of all our proposed metrics. While it might at first seem that the class prior should be task-dependent, recall that the error is the expectation over the state distribution , where the actions were all effective. This is equivalent to following an optimal “expert” policy , and although we are estimating from data generated by behavior policy , we should match the positive class prior we would observe from expert . Assuming the task is feasible, meaning that the policy has effective actions available from the start, we have . Therefore, although the validation dataset will likely have both successes and failures, a prior of is the ideal prior, and this holds independently of the environment. We illustrate this further with a didactic example in Sect. 4.1.

Theorem 1 relies on estimating over the distribution , but our dataset is generated by an unknown behavior policy . A natural approach here would be importance sampling (IS) (Dudik et al., 2011), but: (a) we assume no knowledge of , and (b) IS is not well-defined for deterministic policies . Another approach is to subsample to transitions where  (Liu et al., 2018). This ensures an on-policy evaluation, but can encounter finite sample issues if does not sample frequently enough. Therefore, we assume classification error over is a good enough proxy that correlates well with classification error over . This is admittedly a strong assumption, but empirical results in Sect. 4 show surprising robustness to distributional mismatch. This assumption is reasonable if is broad (e.g., generated by a sufficiently random policy), but may produce pessimistic estimates when potential effective actions in are unlabeled.

3.2 Off-policy classification for OPE

Based off of the derivation from Sect. 3.1, our proposed off-policy classification (OPC) score is defined by the negative loss when in Eqn. 3 is the 0-1 loss. Let be a threshold, with . This gives


To be fair to each , threshold is set separately for each Q-function to maximize . Given transitions and for all , the best for each can be computed by sorting all Q-values, then scanning every threshold , which takes time per Q-function (see Appendix B). This avoids favoring Q-functions that systematically overestimate or underestimate the true value.

Alternatively, can be a soft loss function. We experimented with , which is minimized when is large for and small for . The negative of this loss is called the SoftOPC.


If episodes have different lengths, to avoid focusing on long episodes, transitions from an episode of length are weighted by when estimating OPC and its variants.

3.3 Evaluating OPE metrics

The standard evaluation method for OPE is to report MSE to the true episode return (Thomas & Brunskill, 2016; Liu et al., 2018). However, our metrics do not estimate episode return directly. The ’s estimate of will differ from the true value, since it is estimated over our dataset instead of over the distribution . Meanwhile, does not estimate directly due to using a soft loss function. Despite this, the OPC and SoftOPC are still useful OPE metrics if they correlate well with or episode return .

We propose an alternative evaluation method. Instead of reporting MSE, we train a large suite of Q-functions with different learning algorithms, evaluating true return of the equivalent argmax policy for each , then compare correlation of the metric to true return. We report two correlations, the coefficient of determination of line of best fit, and the Spearman rank correlation  (S. Spearman, 1904).111We slightly abuse notation here, and should clarify that is used to symbolize the coefficient of determination and should not be confused with , the average return of a policy . measures confidence in how well our linear best fit will predict returns of new models, whereas measures confidence the metric ranks different policies correctly, without assuming a linear best fit.

4 Experiments

In this section, we investigate the correlation of OPC and SoftOPC with true average return, and how they may be used for model selection with off-policy data. We compare the correlation of these metrics with the correlation of the baselines, namely the TD Error, Sum of Advantages, and the MCC Error (see Sect. LABEL:sec:baselines) in a number of environments and generalization failure scenarios. For each experiment, a validation dataset is collected with a behavior policy , and state-action pairs are labeled as effective whenever they appear in a successful trajectory. In line with Sect. 3.3, several Q-functions are trained for each task. For each , we evaluate each metric over and true return of the equivalent argmax policy. We report both the coefficient of determination of line of best fit and the Spearman’s rank correlation coefficient  (S. Spearman, 1904). Our results are summarized in Table 1. Our OPC/SoftOPC metrics are implemented using , as explained in Sect. 3 and Appendix D.

4.1 Simple Environments

Binary tree.

As a didactic toy example, we used a binary tree MDP with depth of episode length . In this environment,222Code for the binary tree environment is available at each node is a state with , unless it is a leaf/terminal state with reward . Actions are , and transitions are deterministic. We experimented with two extreme versions of this environment: (a) 1-Failure: where the agent is successful unless it reaches the single failure leaf with ; and (b) 1-Success: where the agent fails unless it reaches the single success leaf with . In our experiments we used a full binary tree of depth . The initial state distribution was uniform over all non-leaf nodes, which means that the initial state could sometimes be initialized to one where failure is inevitable. The validation dataset was collected by generating 1,000 episodes from a uniformly random policy. For the policies we wanted to evaluate, we generated 1,000 random Q-functions by sampling for every , defining the policy as . We compared the correlation of the actual on-policy performance of the policies with the scores given by the OPC, SoftOPC and the baseline metrics using , as shown in Table 1. SoftOPC correlates best and OPC correlates second best.


As we are specifically motivated by image-based tasks with binary rewards, the Atari Bellemare et al. (2013) Pong game was a good choice for a simple environment that can have these characteristics. The visual input is of low complexity, and the game can be easily converted into a binary reward task by truncating the episode after the first point is scored. We learned Q-functions using DQN (Mnih et al., 2015) and DDQN (van Hasselt et al., 2016), varying hyperparameters such as the learning rate, the discount factor , and the batch size, as discussed in detail in Appendix E.2. A total of 175 model checkpoints are chosen from the various models for evaluation, and true average performance is evaluated over 3,000 episodes for each model checkpoint. For the validation dataset we used 38 Q-functions that were partially-trained with DDQN and generated 30 episodes from each, for a total of 1140 episodes. Similarly with the Binary Tree environments we compare the correlations of our metrics and the baselines to the true average performance over a number of on-policy episodes. As we show in Table 1, both our metrics outperform the baselines, OPC performs better than SoftOPC in terms of correlation but is similar in terms of Spearman correlation .

Tree (1 Fail) Tree (1 Succ) Pong Sim Train Sim Test Real-World
TD Err 0.01 -0.13 0.02 -0.15 0.05 -0.18 0.02 -0.37 0.10 -0.51 0.17 0.48
0.00 0.07 0.00 0.00 0.09 -0.32 0.74 0.81 0.74 0.78 0.12 0.50
MCC Err 0.02 -0.17 0.06 -0.26 0.04 -0.36 0.00 0.33 0.06 -0.44 0.01 -0.15
OPC 0.21 0.48 0.21 0.50 0.50 0.72 0.49 0.86 0.35 0.66 0.81 0.87
SoftOPC 0.23 0.53 0.19 0.51 0.36 0.75 0.55 0.76 0.48 0.77 0.91 0.94
Table 1: Summarized results of Experiments section. For each metric (leftmost column), we report of line of best fit and Spearman rank correlation coefficient for each environment (top row). These are: the binary tree and Pong tasks from Sect. 4.1, simulated grasping with train or test objects, and real-world grasping from Sect. 4.2. Baseline metrics are discussed in Sect. LABEL:related, and our metrics (OPC, SoftOPC) are discussed in Sect. 3. Occasionally, some baselines correlate well, but our proposed metrics (last two rows) are consistently among the best metrics for each environment.

4.2 Vision-based Robotic Grasping

Our main experimental results were on simulated and real versions of a robotic environment and a vision-based grasping task, following the setup from Kalashnikov et al. (2018), the details of which we briefly summarize. The observation at each time-step is a RGB image from a camera placed over the shoulder of a robotic arm, of the robot and a bin of objects, as shown in Fig. 0(b). At the start of an episode, objects are randomly dropped in a bin in front of the robot. The goal is to grasp any of the objects in that bin. Actions include continuous Cartesian displacements of the gripper, and the rotation of the gripper around the z-axis. The action space also includes three discrete commands: “open gripper”, “close gripper”, and “terminate episode”. Rewards are sparse, with if any object is grasped and otherwise. All models are trained with the fully off-policy QT-Opt algorithm as described in  Kalashnikov et al. (2018).

In simulation we define a training and a test environment by generating two distinct sets of 5 objects that are used for each, shown in Fig. LABEL:fig:train_test_objects. In order to capture the different possible generalization failure scenarios discussed in Sect. LABEL:sec:generalization, we trained Q-functions in a fully off-policy fashion with data collected by a hand-crafted policy with a 60% grasp success rate and -greedy exploration (with =0.1) with two different datasets both from the training environment. The first consists of episodes, with which we can show we have insufficient off-policy training data to perform well even in the training environment. The second consists of episodes, with which we can show we have sufficient data to perform well in the training environment, but due to mismatched off-policy training data we can show that the policies do not generalize to the test environment (see Fig. LABEL:fig:train_test_objects for objects and Appendix E.3 for the analysis). We saved policies at different stages of training which resulted in 452 policies for the former case and 391 for the latter. We evaluated the true return of these policies on 700 episodes on each environment and calculated the correlation with the scores assigned by the OPE metrics based on held-out validation sets of episodes from the training environment and episodes from the test one, which we show in Table 1.

The real-world version of the environment has objects that were never seen during training (see Fig. 0(b) and 7). We evaluated 15 different models, trained to have varying degrees of robustness to the training and test domain gap, based on domain randomization and randomized–to-canonical adaptation networks (James et al., 2019).333For full details of each of the models please see Appendix E.4. Out of these, 7 were trained on-policy purely in simulation. True average return was evaluated over 714 episodes with 7 different sets of objects, and true policy real-world performance ranged from 17% to 91%. The validation dataset consisted of real-world episodes, 40% of which were successful grasps and the objects used for it were separate from the ones used for final evaluation used for the results in Table 1.

(a) SoftOPC and return in sim
(b) Scatterplot for real-world grasping
Figure 2: (a): SoftOPC in simulated grasping. Overlay of SoftOPC (red) and return (blue) in simulation for model trained with 100k grasps. SoftOPC tracks episode return. (b): Scatterplots for OPE metrics and real-world grasp success. Scatterplots for

(left) and SoftOPC (right) for the Real-World grasping task. Each point is a different grasping model. Shaded regions are a 95% confidence interval.

works in simulation but fails on real data, whereas SoftOPC works well in both.

4.3 Discussion

Table 1 shows and for each metric for the different environments we considered. Our proposed SoftOPC and OPC consistently outperformed the baselines, with the exception of the simulated robotic test environment, on which the SoftOPC performed almost as well as the discounted sum of advantages on the Spearman correlation (but worse on ). However, we show that SoftOPC more reliably ranks policies than the baselines for real-world performance without any real-world interaction, as one can also see in Fig. 1(b). The same figure shows the sum of advantages metric that works well in simulation performs poorly in the real-world setting we care about. Appendix F includes additional experiments showing correlation mostly unchanged on different validation datasets.

Furthermore, we demonstrate that SoftOPC can track the performance of a policy acting in the simulated grasping environment, as it is training in Fig. 1(a), which could potentially be useful for early stopping. Finally, SoftOPC seems to be performing slightly better than OPC in most of the experiments. We believe this occurs because the Q-functions compared in each experiment tend to have similar magnitudes. Preliminary results in Appendix H suggest that when Q-functions have different magnitudes, OPC might outperform SoftOPC.

5 Conclusion and future work

We proposed OPC and SoftOPC, classification-based off-policy evaluation metrics that can be used together with Q-learning algorithms. Our metrics can be used with binary reward tasks: tasks where each episode results in either a failure (zero return) or success (a return of one). While this class of tasks is a substantial restriction, many practical tasks actually fall into this category, including the real-world robotics tasks in our experiments. The analysis of these metrics shows that it can approximate the expected return in deterministic binary reward MDPs. Empirically, we find that OPC and the SoftOPC variant correlate well with performance across several environments, and predict generalization performance across several scenarios. including the simulation-to-reality scenario, a critical setting for robotics. Effective off-policy evaluation is critical for real-world reinforcement learning, where it provides an alternative to expensive real-world evaluations during algorithm development. Promising directions for future work include developing a variant of our method that is not restricted to binary reward tasks, and extending the analysis to stochastic tasks. However, even in the binary setting, we believe that methods such as ours can provide for a substantially more practical pipeline for evaluating transfer learning and off-policy reinforcement learning algorithms.


We would like to thank Razvan Pascanu, Dale Schuurmans, George Tucker, and Paul Wohlhart for valuable discussions.


  • Babaeizadeh et al. (2018) Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. In International Conference on Representation Learning, 2018.
  • Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    , 47:253–279, 2013.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Cobbe et al. (2018) Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009, 2009.
  • Dudik et al. (2011) Dudik, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In ICML, March 2011.
  • Dudík et al. (2014) Dudík, M., Erhan, D., Langford, J., Li, L., et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
  • Farahmand & Szepesvári (2011) Farahmand, A.-M. and Szepesvári, C. Model selection in reinforcement learning. Mach. Learn., 85(3):299–332, December 2011.
  • Hanna et al. (2017) Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Confidence intervals for Off-Policy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’17, pp. 538–546, Richland, SC, 2017. International Foundation for Autonomous Agents and Multiagent Systems.
  • Horvitz & Thompson (1952) Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
  • James et al. (2019) James, S., Wohlhart, P., Kalakrishnan, M., Kalashnikov, D., Irpan, A., Ibarz, J., Levine, S., Hadsell, R., and Bousmalis, K. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , March 2019.
  • Jiang & Li (2015) Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. November 2015.
  • Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In ICML, 2002.
  • Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  • Kiryo et al. (2017) Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In NeurIPS, pp. 1675–1685, 2017.
  • Koos et al. (2010) Koos, S., Mouret, J.-B., and Doncieux, S. Crossing the reality gap in evolutionary robotics by promoting transferable controllers. In

    Proceedings of the 12th annual conference on Genetic and evolutionary computation

    , pp. 119–126. ACM, 2010.
  • Koos et al. (2012) Koos, S., Mouret, J.-B., and Doncieux, S. The transferability approach: Crossing the reality gap in evolutionary robotics. IEEE Transactions on Evolutionary Computation, 17(1):122–145, 2012.
  • Lee et al. (2018) Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Liu et al. (2018) Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A., Doshi-Velez, F., and Brunskill, E. Representation balancing mdps for off-policy policy evaluation. In NeurIPS, 2018.
  • Mahmood et al. (2014) Mahmood, A. R., van Hasselt, H. P., and Sutton, R. S. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pp. 3014–3022, 2014.
  • Mannor et al. (2007) Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N.

    Bias and variance approximation in value function estimates.

    Management Science, 53(2):308–322, 2007.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Murphy (2005) Murphy, S. A. A generalization error for Q-Learning. J. Mach. Learn. Res., 6:1073–1097, July 2005.
  • Nichol et al. (2018) Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman, J. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018.
  • Precup et al. (2000) Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for off-policy policy evaluation. In

    Proceedings of the Seventeenth International Conference on Machine Learning, 2000

    , pp. 759–766. Morgan Kaufmann, 2000.
  • Quillen et al. (2018) Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., and Levine, S. Deep reinforcement learning for Vision-Based robotic grasping: A simulated comparative evaluation of Off-Policy methods. February 2018.
  • Raghu et al. (2018) Raghu, M., Irpan, A., Andreas, J., Kleinberg, R., Le, Q., and Kleinberg, J. Can deep reinforcement learning solve erdos-selfridge-spencer games? In International Conference on Machine Learning, pp. 4235–4243, 2018.
  • Riedmiller et al. (2018) Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playing-solving sparse reward tasks from scratch. In International Conference on Machine Learning, 2018.
  • Ross & Bagnell (2010) Ross, S. and Bagnell, D. Efficient reductions for imitation learning. In AISTATS, pp. 661–668, 2010.
  • S. Spearman (1904) S. Spearman, C. The proof and measurement of association between two things. The American Journal of Psychology, 15:72–101, 01 1904. doi: 10.2307/1412159.
  • Theocharous et al. (2015) Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. Personalized ad recommendation systems for life-time value optimization with guarantees. In IJCAI, pp. 1806–1812, 2015.
  • Thomas & Brunskill (2016) Thomas, P. and Brunskill, E. Data-Efficient Off-Policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, June 2016.
  • Thomas et al. (2015) Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-Confidence Off-Policy evaluation. AAAI, 2015.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
  • van Hasselt et al. (2016) van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • Zhang et al. (2018a) Zhang, A., Ballas, N., and Pineau, J. A dissection of overfitting and generalization in continuous reinforcement learning. June 2018a.
  • Zhang et al. (2018b) Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. April 2018b.

Appendix for Off-Policy Evaluation of Generalization for Deep Q-Learning in Binary Reward Tasks

Appendix A Classification error bound

a.1 Trajectory return all effective

For the forward direction, because ends in a success state, from any the optimal policy could reach a success state, so all must be effective.

For the reverse direction, if all are effective, then must be effective. Since is a terminal state with no further actions, for to be effective, we must have .

a.2 Proof of Theorem 1

By definition, succeeds if and only if at every , it selects an effective . Since is defined as the state distribution conditioned on being effective, the failure rate can be written as


This gives as desired. The bound is tight when .

a.3 Alternate proof connecting to behavioral cloning

Since every policy that picks only effective actions achieves the optimal reward of 1, and is defined as the 0-1 loss over states conditioned on not selecting a catastrophic action, we can view as the 0-1 behavior cloning loss to an expert policy . In this section, we present an alternate proof based on behavioral cloning bounds from Ross & Bagnell (2010).

Theorem 2.1 of Ross & Bagnell (2010) proves a cost bound for general MDPs. This differs from the cost derived above. The difference in bound comes because Ross & Bagnell (2010) derive their proof in a general MDP, whose cost is upper bounded by at every timestep. If deviates from the expert, it receives cost several times, once for every future timestep. In binary reward MDPs, we only receive this cost once, at the final timestep. Transforming the proof to incorporate our binary reward MDP assumptions lets us recover the upper bound from Appendix A.2. We briefly explain the updated proof, using notation from (Ross & Bagnell, 2010) to make the connection more explicit.

Define as the expected 0-1 loss at time for under the state distribution of . Since corresponds to states visits conditioned on never picking a catastrophic action, this is the same as our definition of . The MDP is defined by cost instead of reward: cost of state is for all timesteps except the final one, where . Let be the probability hasn’t made a mistake (w.r.t ) in the first steps, be the state distribution conditioned on no mistakes in the first steps, and be the state distribution conditioned on making at least 1 mistake. In a general MDP with , total cost is bounded by , where the 1st term is cost while following the expert and the 2nd term is an upper bound of cost if outside of the expert distribution. In a binary reward MDP, since for all except , we can ignore every term in the summation except the final one, giving


Note , and as shown in the original proof, . Since is a probability, we have , which recovers the bound, and again this is tight when .


Appendix B Efficiently computing the OPC


Suppose we have transitions, of which of them have positive labels. Imagine placing each on the number line. Each is annotated with a score, for unlabeled transitions and for positive labeled transitions. Imagine sliding a line from to . At , the OPC score is . The OPC score only updates when this line passes some in our dataset, and is updated based on the score annotated at . After sorting by Q-value, we can find all possible OPC scores in time by moving from to , noting the updated score after each we pass. Given , we sort the transitions in , annotate them appropriately, then compute the maximum over all OPC scores.

Appendix C Baseline metrics

We elaborate on the exact expressions used for the baseline metrics. In all baselines, is the on-policy action .

Temporal-difference error

The TD error is the squared error between and the 1-step return estimate of the action’s value.

Discounted sum of advantages

The difference of the value functions of two policies and at state is given by the discounted sum of advantages (Kakade & Langford, 2002; Murphy, 2005) of on episodes induced by :


where is the discount factor and the advantage function for policy , defined as . Since is fixed, estimating (13) is sufficient to compare and . The with smaller score is better.

Monte-Carlo estimate corrected with the discounted sum of advantages

Estimating with the Monte-Carlo return, substituting into Eqn. (13), and rearranging gives


With , we can obtain an approximate estimate depending on the whole episode:


The MCC Error is the squared error to this estimate.


Note that (17) was proposed before by Quillen et al. (2018) as a training loss for a Q-learning variant, but not for the purpose of off-policy evaluation.

Eqn. (12) and Eqn. (16) share the same optimal Q-function, so assuming a perfect learning algorithm, there is no difference in information between these metrics. In practice, the Q-function will not be perfect due to errors in function approximation and optimization. Eqn. (16) is designed to rely on all future rewards from time , rather than just . We theorized that using more of the “ground truth” from could improve the metric’s performance in imperfect learning scenarios.

Appendix D Argument for choosing

The positive class prior should intuitively depend on the environment, since some environments will have many more effective than others. However, recall how error is defined. Each is defined as:


where state distribution is defined such that were all effective. This is equivalent to following an optimal “expert” policy , and although we are estimating from data generated by behavior policy , we should match the positive class prior we would observe from expert . Assuming the task is feasible, meaining the policy has effective actions available from the start, we have . Therefore, although the validation dataset will likely have both successes and failures, a prior of is the ideal prior, and this holds independently of the environment. As a didactic toy example, we show this holds for a binary tree domain. In this domain, each node is a state, actions are , and leaf nodes are terminal with reward or . We try in two extremes: only 1 leaf fails, or only 1 leaf succeeds. Validation data is from the uniformly random policy. The frequency of effective varies a lot between the two extremes, but in both Spearman correlation monotonically increases with and was best with . Fig. 3 shows Spearman correlation of OPC and SoftOPC with respect to , when the tree is mostly success or failures. In both settings has the best correlation.

Figure 3: Spearman correlation of SoftOPC, OPC, and baselines with varying . Baselines do not depend on . Correlations further from 0 are better.

From an implementation perspective, is also the only choice that can be applied across arbitrary validation datasets. Suppose , the policy collecting our validation set, succeeds with probability . In the practical computation of OPC presented in Appendix B, we have transitions, and of them have positive labels. Each is annotated with a score: for unlabeled transitions and for positive labeled transitions. The maximal OPC score will be the sum of all annotations within the interval , and we maximize over .

For unlabeled transitions, the annotation is , which is negative. Suppose the annotation for positive transitions was negative as well. This occurs when . If every annotation is negative, then the optimal choice for is , since the empty set has total 0 and every non-empty subset has a negative total. This gives , no matter what we are evaluating, which makes the OPC score entirely independent of episode return.

This degenerate case is undesirable, and happens when , or equivalently . To avoid this, we should have . If we wish to pick a single that can be applied to data from arbitrary behavior policies , then we should pick . In binary reward MDPs where can always succeed, this gives , and since the prior is a probability, it should satisfy , leaving as the only option.

To complete the argument, we must handle the case where we have a binary reward MDP where . In a binary reward MDP, the only way to have is if the sampled initial state is one where is catastrophic for all . From these , and all future reachable from , the actions chooses do not matter - the final return will always be . Since is defined conditioned on only executing effective actions so far, it is reasonable to assume we only wish to compute the expectation over states where our actions can impact reward. If we computed optimal policy return over just the initial states where effective actions exist, we have , giving once again.

Appendix E Experiment details

e.1 Binary tree environment details

The binary tree is a full binary tree with levels. The initial state distribution is uniform over all non-leaf nodes. Initial state may sometimes be initialized to one where failure is inevitable. The validation dataset is collected by generating 1,000 episodes from the uniformly random policy. For Q-functions, we generate 1,000 random Q-functions by sampling for every , defining the policy as . We try priors . Code for this environment is available at

e.2 Pong details

Fig. 4 is a scatterplot of our Pong results. Each color represents a different hyperparameter setting, as explained in the legend. From top to bottom, the abbreviations in the legend mean:

  • DQN: trained with DQN

  • DDQN: trained with Double DQN

  • DQN_gamma9: trained with DQN, (default is ).

  • DQN2: trained with DQN, using a different random seed

  • DDQN2: trained with Double DQN, using a different random seed

  • DQN_lr1e4: trained with DQN, learning rate (default is ).

  • DQN_b64: trained with DQN, batch size 64 (default is 32).

  • DQN_fixranddata: The replay buffer is filled entirely by a random policy, then a DQN model is trained against that buffer, without pushing any new experience into the buffer.

  • DDQN_fixranddata: The replay buffer is filled entirely by a random policy, then a Double DQN model is trained against that buffer, without pushing new experience into the buffer.

In Fig. 4, models trained with are highlighted. We noticed that SoftOPC was worse at separating these models than OPC, suggesting the 0-1 loss is preferable in some cases. This is discussed further in Appendix H.

In our implementation, all models were trained in the full version of the Pong game, where the maximum return possible is points. However, to test our approach we create a binary version for evaluation. Episodes in the validation set were truncated after the first point was scored. Return of the policy was estimated similarly: the policy was executed until the first point is scored, and the average return is computed over these episodes. Although the train time environment is slightly different from the evaluation environment, this procedure is fine for our method, since our method can handle environment shift and we treat as a black-box scoring function. OPC can be applied as long as the validation dataset matches the semantics of the test environment where we estimate the final return.

Figure 4: Scatterplot of episode return (x-axis) of Pong models against metric (y-axis), for SoftOPC (left) and OPC (right). Each color is a set of model checkpoints from a different hyperparameter setting, with the legend explaining the mapping from color to hyperparameters. In each plot, points trained with DQN, are boxed with a red rectangle. We observed that the hard 0-1 loss in OPC does a better job separating these models than the soft loss in SoftOPC.

e.3 Simulated grasping details

The objects we use were generated randomly through procedural generation. The resulting objects are highly irregular and are often non-convex. Some example objects are shown in Fig. 4(a).

(a) Procedurally generated objects for simulation
(b) Test objects for real-world evaluation
Figure 5: (a): Example objects from the procedural generation process used during training in simulation. (b): Real test objects used during evaluation on real robots.

Fig. 6 demonstrates two generalization problems from Sect. LABEL:sec:generalization: insufficient off-policy training data and mismatched off-policy training data. We trained two models with a limited 100k grasps dataset or a large 900k grasps dataset, then evaluated grasp success. The model with limited data fails to achieve stable grasp success due to overfitting to its limited dataset. Meanwhile, the model with abundant data learns to model the train objects, but fails to model the test objects, since they are unobserved at training time.

Figure 6: Left: Grasp success curve of model trained with 900k or 100k grasps. The 100k grasps model oscillates in performance. Middle: We see why: holdout TD error (blue) of the 100k grasps model is increasing. Right: The TD Error for the 900k grasp model is the same for train and holdout, but is still larger for test data on unseen test objects.

e.4 Real-world grasping

Figure 7: Visual differences for the robot grasping task between simulation (left) and reality (right). In simulation, models are trained to grasp procedurally generated shapes while in reality objects have more complex shapes. The simulated robot arm does not have the same colors and textures as the real robot and lacks the visible real cables. The tray in reality has a greater variation in appearance than in the simulation.

Several visual differences between simulation and reality limit the real performance of model trained in simulation (see Fig. 7) and motivate simulation-to-reality methods such as the Randomized-to-Canonical Adaptation Networks (RCANs), as proposed by James et al. (2019). The 15 real-world grasping models evaluated were trained using variants of RCAN. These networks train a generator to transform randomized simulation images to a canonical simulated image. A policy is learned over this canonical simulated image. At test time, the generator transforms real images to the same canonical simulated image, facilitating zero-shot transfer. Optionally, the policy can be fine-tuned with real-world data, in this case the real-world training objects are distinct from the evaluation objects. The SoftOPC and real-world grasp success of each model is listed in Table 2. From top-to-bottom, the abbreviations mean:

  • Sim: A model trained only in simulation.

  • Randomized Sim: A model trained only in simulation with the mild randomization scheme from James et al. (2019): random tray texture, object texture and color, robot arm color, lighting direction and brightness, and one of 6 background images consisting of 6 different images from the view of the real-world camera.

  • Heavy Randomized Sim: A model trained only in simulation with the heavy randomization scheme from James et al. (2019): every randomization from Randomized Sim, as well as slight randomization of the position of the robot arm and tray, randomized position of the divider within the tray (see Figure 1b in main text for a visual of the divider), and a more diverse set of background images.

  • Randomized Sim + Real (2k): The Randomized Sim Model, after training on an additional 2k grasps collected on-policy on the real robot.

  • Randomized Sim + Real (3k): The Randomized Sim Model, after training on an additional 3k grasps collected on-policy on the real robot.

  • Randomized Sim + Real (4k): The Randomized Sim Model, after training on an additional 4k grasps collected on-policy on the real robot.

  • Randomized Sim + Real (5k): The Randomized Sim Model, after training on an additional 5k grasps collected on-policy on the real robot.

  • RCAN: The RCAN model, as described in James et al. (2019), trained in simulation with a pixel level adaptation model.

  • RCAN + Real (2k): The RCAN model, after training on an additional 2k grasps collected on-policy on the real robot.

  • RCAN + Real (3k): The RCAN model, after training on an additional 3k grasps collected on-policy on the real robot.

  • RCAN + Real (4k): The RCAN model, after training on an additional 4k grasps collected on-policy on the real robot.

  • RCAN + Real (5k): The RCAN model, after training on an additional 5k grasps collected on-policy on the real robot.

  • RCAN + Dropout: The RCAN model with dropout applied in the policy.

  • RCAN + InputDropout: The RCAN model with dropout applied in the policy and RCAN generator.

  • RCAN + GradsToGenerator: The RCAN model where the policy and RCAN generator are trained simultaneously, rather than training RCAN first and the policy second.

Model SoftOPC Real Grasp Success (%)
Sim 0.056 16.67
Randomized Sim 0.072 36.92
Heavy Randomized Sim 0.040 34.90
Randomized Sim + Real (2k) 0.129 72.14
Randomized Sim + Real (3k) 0.141 73.65
Randomized Sim + Real (4k) 0.149 82.92
Randomized Sim + Real (5k) 0.152 84.38
RCAN 0.113 65.69
RCAN + Real (2k) 0.156 86.46
RCAN + Real (3k) 0.166 88.34
RCAN + Real (4k) 0.152 87.08
RCAN + Real (5k) 0.159 90.71
RCAN + Dropout 0.112 51.04
RCAN + InputDropout 0.089 57.71
RCAN + GradsToGenerator 0.094 58.75
Table 2: Real-world grasping models used for Sect. 4.2 simulation-to-reality experiments.

Appendix F SoftOPC performance on different validation datasets

For real grasp success we use 7 KUKA LBR IIWA robot arms to each make 102 grasp attempts from 7 different bins with test objects (see Fig. 4(b)). Each grasp attempt is allowed up to 20 steps and any grasped object is dropped back in the bin, a successful grasp is made if any of the test objects is held in the gripper at the end of the episode.

For estimating SoftOPC, we use a validation dataset collected from two policies, a poor policy with a success of 28%, and a better policy with a success of 51%. We divided the validation dataset based on the policy used, then evaluated SoftOPC on data from only the poor or good policy. Fig. 8 shows the correlation on these subsets of the validation dataset. The correlation is slightly worse on the poor dataset, but the relationship between SoftOPC and episode reward is still clear.

Figure 8: SoftOPC versus the real grasp success over different validation datasets for Real-World grasping. Left: SoftOPC over entire validation dataset. Middle: SoftOPC over validation data from only the poor policy (28% success rate). Right: SoftOPC over validation data from only the better policy (51% success). In each, a fitted regression line with its and 95% confidence interval is also shown.

As an extreme test of robustness, we go back to the simulated grasping environment. We collect a new validation dataset, using the same human-designed policy with greedy exploration instead. The resulting dataset is almost all failures, with only 1% of grasps succeeding. However, this dataset also covers a broad range of states, due to being very random. Fig. 9 shows the OPC and SoftOPC still perform reasonably well, despite having very few positive labels. From a practical standpoint, this suggests that OPC or SoftOPC have some robustness to the choice of generation process for the validation dataset.

Figure 9: SoftOPC and OPC over almost random validation data on test objects in simulated grasping. We generate a validation dataset from an -greedy policy where , leading to a validation dataset where only 1% of episodes succeed. Left: SoftOPC over the poor validation dataset. . Right: OPC over the poor validation dataset. .

Appendix G Plots of Q-value distributions

In Fig. 10, we plot the Q-values of two real-world grasping models. The first is trained only in simulation and has poor real-world grasp success. The second is trained with a mix of simulated and real-world data. We plot a histogram of the average Q-value over each episode of validation set . The better model has a wider separation between successful Q-values and failed Q-values.

Figure 10: Q-value distributions for successful and failed episodes. Left: Q-value distributions over successful and failed episodes in an off-policy data-set according to a learned policy with a poor grasp success rate of 36%. Right: The same distributions after the learned policy is improved by 5,000 grasps of real robot data, achieving a 84% grasp success rate.
Figure 11: Spearman correlation in binary tree with one success state for different Q-function generation methods. Varying magnitudes between Q-functions causes the SoftOPC to perform worse.

Appendix H Comparison of OPC and SoftOPC

We elaborate on the argument presented in the main paper, that OPC performs better when have different magnitudes, and otherwise SoftOPC does better. To do so, it is important to consider how the Q-functions were trained. In the tree environments, was sampled uniformly from , so by construction. In the grasping environments, the network architecture ends in , so . In these experiments, SoftOPC did better. In Pong, was not constrained in any way, and these were the only experiments where discount factor was varied between models. Here, OPC did better.

The hypothesis that Q-functions of varying magnitudes favor OPC can be validated in the tree environment. Again, we evaluate 1,000 Q-functions, but instead of sampling , the th Q-function is sampled from . This produces 1,000 different magnitudes between the compared Q-functions. Fig. 10(a) demonstrates that when magnitudes are deliberately changed for each Q-function, the SoftOPC performs worse, whereas the non-parametric OPC is unchanged. To demonstrate this is caused by a difference in magnitude, rather than large absolute magnitude, OPC and SoftOPC are also evaluated over . Every Q-function has high magnitude, but their magnitudes are consistently high. As seen in Fig. 10(b), in this setting the SoftOPC goes back to outperforming OPC.

Appendix I Scatterplots of Each Metric

We present scatterplots of each of the metrics in the simulated grasping environment from Sect. 4.2. We trained two Q-functions in a fully off-policy fashion, one with a dataset of episodes, and the other with a dataset of episodes. For every metric, we generate a scatterplot of all the model checkpoints. Each model checkpoint is color coded by whether it was trained with episodes or episodes.

Figure 12: Scatterplots of each metric in simulated grasping over train objects. From left-to-right, top-to-bottom, we present scatterplots for: the TD error, , MCC error, OPC, and SoftOPC.