1 Introduction
Offpolicy evaluation (OPE), which aims to estimate the online/true performance^{1}^{1}1In this paper, we will use online performance and true performance interchangeably. of a policy using only precollected historical data generated by other policies, is critical to many realworld applications, in which evaluating a poorly performed policy online might be prohibitively expensive (e.g., in trading, advertising, traffic control) or even dangerous (e.g., in robotics, autonomous vehicles, drug trials).
Existing OPE methods can be roughly categorized into three classes: distribution correction based Thomas et al. (2015); Liu et al. (2018); Hanna et al. (2019); Xie et al. (2019); Nachum et al. (2019); Zhang et al. (2020); Nachum and Dai (2020); Yang et al. (2020); Kostrikov and Nachum (2020), model estimation based Mannor et al. (2004); Thomas and Brunskill (2016); Hanna et al. (2017); Zhang et al. (2021), and Qestimation based Le et al. (2019); Munos et al. (2016); Harutyunyan et al. (2016); Precup (2000); Farajtabar et al. (2018) methods. While those methods are based on different assumptions and with different formulations, most of them (1) focus on precisely estimating the expected return of a target policy using precollected historical data, and (2) perform unsupervised estimation without directly leveraging the online performance of previously deployed policies.
We notice that there are some mismatches between the settings of those OPE methods and the OPE problem in realworld applications. First, in many applications, we do not need to estimate the exact true performance of a target policy; instead, what we need is to compare the true performance of a set of candidate policies and identify the best one which will then be deployed into realworld systems. That is, correct ranking of policies rather than precise return estimation is the goal of offpolicy evaluation. Second, in realworld applications, we usually know the true performance of polices that have been deployed into realworld systems and interacted with realworld users. Such information is not well exploited in today’s OPE methods.
Based on these observations, in this work, we define a new problem, supervised offpolicy ranking (SOPR for short), which is different from previous OPE in two aspects. First, in SOPR, the true performance of a set of precollected policies is available in addition to offlinepolicy trajectory data. Second, SOPR aims to correctly rank new policies, rather than accurately estimate the expected returns of the policies.
A straightforward way to rank policies is to first use a scoring model to score each policy and then rank them based on their scores. Following this idea, we propose a supervised learning based method to train a scoring model. The scoring model takes logged states in historical data and the actions taken offline by a policy as raw features. Considering that there might be plenty of states in precollected data, we first adopt kmeans clustering to cluster similar states, use a Transformer encoder to encode stateaction pairs in each cluster, then another Transformer encoder to encode information of those clusters, and finally map the outputs of the second encoder to a score by a multilayer perceptron (MLP). That is, our scoring model consists of two Transformer encoders and a MLP, which are jointly trained to correctly rank the precollected policies with known performance. Our method is named as SOPRT, where “T” stands for Transformer based model.
Experiments on multiple Mujoco games and different types of D4RL datasets Fu et al. (2020) show that our SOPRT outperforms strong baseline OPE methods in terms of both effectiveness (including rank correlation and the performance gap between the groundtruth top 3 policies and the identified top 3 policies) and stability.
The main contributions of this work are summarized as follows:

[leftmargin=*]

We point out that OPE should focus on policy ranking rather than exactly estimating the returns of policies, which simplifies the task of OPE.

According to our knowledge, we are the first to introduce supervised learning into OPE. We only take a preliminary step towards supervised OPE in this work and define the problem of supervised offpolicy ranking. We hope that our work can inspire more solid work along this direction.

We propose a Transformer encoder based hierarchical model for policy scoring. Experiments demonstrate the effectiveness and stability of our method. Our code and data have been released ^{2}^{2}2https://github.com/SOPRT/SOPRT.
2 Related Work
Distribution correction based methods aim to correct the distribution shift of data from the behavior data distribution to the target data distribution. These methods leverage importance sampling (IS) to correct the weight of reward used to calculate the average return. Early methods mainly focus on the distribution correction of a trajectory or truncated trajectory, such as trajectorywise IS, stepwise IS, and normalized (weighted) stepwise IS Thomas et al. (2015); Voloshin et al. (2019)
. Because of involving the multiplication of IS weights over time steps, these methods suffer from large variance. Even though using weight normalization, the estimation variance is still too large
Levine et al. (2020). Recently, more studies focus on correcting the stationary state distribution Liu et al. (2018); Xie et al. (2019); Nachum et al. (2019); Zhang et al. (2020); Nachum and Dai (2020); Yang et al. (2020). However, not matter using which kind of distribution correction, if the coverage of dataset against the target dataset is small, big estimation error is inevitable.Instead of only leveraging data in the given dataset, modelbased methods estimate the environment model including a state transition distribution and a reward function, and generate data using the the simulated environment and the target policy Mannor et al. (2004); Hanna et al. (2017); Zhang et al. (2021)
. Then, the expected return of the target policy can be estimated using the returns of MonteCarlo (MC) rollouts in the simulated environment. Recently, modelbased reinforcement learning (RL) has achieved a lot of progress in terms of data efficiency and model estimation accuracy
Janner et al. (2019); Voloshin et al. (2021); Kaiser et al. (2019). However, these works focus on online RL, where the estimation bias can be reduced by collecting more data from the environment. In contrast, the dataset is static in OPE. Thus, for the stateaction pairs that are within the distribution of the target data but out of the distribution of the given dataset, traditional modelbased methods may generate large estimation error in terms of both reward and state transition. Moreover, for the modelbased methods that resort to MC rollouts to estimate the return, onestep erroneous transition estimation may cause a great estimation error.Qestimation based OPE methods estimate a Qfunction by leveraging Bellman expectation backup operator Sutton and Barto (2018) iteratively based on the offpolicy data. A representative method is Fitted Q evaluation (FQE) Le et al. (2019)
, which learns a neural network to approximate the Q function of the target policy. However, FQE suffers large uncertainty about the value of the unseen stateaction pairs. Moreover, evaluating each target policy needs a training procedure, which is timeconsuming. Another important branch of OPE study is doubly robust (DR) OPE
Thomas and Brunskill (2016); Dudík et al. (2011); Jiang and Li (2016), which combines reward function estimation or Qfunction estimation with IS. Theoretical analysis proves that if either the behavior policy estimate or the reward or Q estimate is accurate, the DR estimate will be close to the true value. However, ISbased, modelbased, and Qestimationbased methods all have drawbacks that cannot be offset by each other.OffPolicy Classification (OPC) based OPE method Irpan et al. (2019) aims at the same goal with our work, i.e. to rank several candidate policies. OPC does not estimate the expected return of a policy but instead uses a statistic to measure the success rate of the actions generated by the policy. However, OPC needs to leverage a Qfunction and thus its performance is limited by the performance of Qfunction learning. In addition, a recent work Paine et al. (2020) demonstrated that using either some offline RL algorithms or FQE to learn a Qfunction in OPC performs worse than FQE in most cases.
Offline RL also involves OPE problem. Good policy evaluation can tell how good the current learned policy is and the direction to improve the policy, which is important for offline RL. Actually, the critic in the actorcritic learning architecture plays the role of policy evaluation. Some works Fujimoto et al. (2019a, b); Kumar et al. (2019); Boehmke et al. (2020) focus on mitigating the extrapolation error caused by the distribution mismatch between the bootstrapping actions (used in the critic learning) and the actions in the dataset. A common method is to constrain the distribution of the data that can be visited by the learned policy from being mismatched with the dataset. However, the performance of the resulting policy is also limited by this kind of constraint. Particularly, when the distribution of the dataset is distant from that corresponding to the optimal policy, the aforementioned constraint will severely hinder policy optimization.
3 The Problem of Supervised Offpolicy Ranking
In this section, we first give some notations and then formally describe the problem of supervised offpolicy ranking.
We consider OPE in a Markov Decision Process (MDP) defined by a tuple
. and denote state and action space. and are state transition distribution and reward function. is a discount factor. The expected return of a policy is defined as , where , , and is the time horizon of the MDP.The goal of OPE is to evaluate a policy without running it in the environment. Traditional OPE methods estimate the expected return of a policy by leveraging a precollected dataset , where , composed of trajectories, which are generated by some other policies (usually called behavioral policy).
In many realworld applications where we need to evaluate a policy offline before deploying it into online systems, in addition to the precollected dataset, we can also collect a set of previously deployed policies and their true performance after being deployed into online systems. Clearly, those information is helpful for offpolicy evaluation, but unfortunately was ignored in previous OPE methods. In this work, we define a new kind of OPE problem, the supervised OPE problem, as below.
Definition 1
Supervised Offpolicy Evaluation: given a precollected dataset with trajectories and a set of precollected policies with known performance, estimate the performance of a target policy or a set of target policies without interacting with the environment.
Comparing with previously studied OPE problems, our new problem has more information available, the set of precollected policies
with known performance, which can serve as supervision while we build an OPE model/method. This is why we call this problem “supervised" OPE, as we have label information (i.e., the true performance) for some policies available in training. In contrast, previous OPE methods can be viewed as unsupervised learning, since they do not directly learn from the online performance of precollected policies. In realworld applications such as advertising and recommendation systems, we can get the true performance of previously deployed polices by observing and counting user click behaviors.
In realworld applications we often need to compare a set of candidate policies and choose a good one from them, where what we really need is the correct ordering of a set of policies, rather than the exact value/performance of each policy. Thus, we define a variant of the supervised OPE problem, which is called supervised offpolicy ranking.
Definition 2
Supervised Offpolicy Ranking: given a precollected dataset with trajectories and a set of precollected policies together with their relative goodness, rank a set of target policies without interacting with the environment.
It is not difficult to see that policy ranking is relatively easier than policy evaluation, since accurate evaluations of policies inherently implies correct ranking of them, but correct ranking does not need accurate evaluations. In the following parts of this paper, we focus on supervised offpolicy ranking (SOPR), considering its practical value and simplicity.
4 Our Method
We introduce our method for SOPR in this section, which trains a scoring model to score and rank target policies. We start with introducing policy representation (Section 4.1
), and then loss function and training pipeline (Section
4.2).4.1 Policy Representation
Policies can be of very different formats: a policy can be a set of expert designed rules, a linear function of states, decision trees over states, or deep neural networks, e.g., convolutional neural networks, recurrent neural networks, and attention based networks. To map a policy to a score, the first question to answer is how to represent policy with very diverse formats.
To handle all kinds of policies, we should leverage their shared points rather than differences. Obviously, we cannot use the internal parameters of policies, because different policies may have different numbers of parameters and some policy may even do not have parameters. Fortunately, all the policies for the same task or game share the same input space, the state space , and the same output space . Therefore, we propose to build the representation of a policy upon state action pairs.
Let denote the set of all the states in the precollected data . For a policy to rank, we let it take actions for all the states in and obtain a dataset , where . Now, any policy can be represented by a dataset in the same format. ^{3}^{3}3For simplicity of description, we consider deterministic policies here. For stochastic policies, we can use the distribution over actions for a state and get a dataset .
Now the question becomes how to map a set of points/pairs^{4}^{4}4Each state action pair can be viewed as a point in a highdimensional space. to a score that indicates the goodness of the policy . Following Kool et al. (2018); Nazari et al. (2018); Bello et al. (2016); Xin et al. (2020); Vinyals et al. (2015), we use a Transformer encoder Vaswani et al. (2017)
to encode all those points. In particular, we first project each point into an embedding vector, then use a few selfattention layers to get high level representations of those points, aggregate them by average pooling to get the representation vector of the dataset (and thus the policy), and finally linearly project the vector to a score.
A computational challenge is that there could be millions of states in precollected historical data. It is impossible for Transformer to handle such a large scale of data. Our solution is to down sample the states and encode the state action pairs in a hierarchical way as follows:

[leftmargin=*]

Randomly sample a sub set of states from and cluster them into clusters by kmeans.

Let a policy take actions over states in and obtain a set of state action pairs.

Use a lowlevel Transformer encoder to encode all the state action pairs in a cluster and get a vector representation by average pooling for each cluster.

Use a highlevel Transformer encoder to encode all the clusters and get a vector representation by average pooling for the policy.

Linearly project the vector in the above step to a score.
Figure 1 illustrates our scoring function. Since our scoring function is based on Transformer, we name our method SOPRT. Let denote our scoring function with parameter , which maps a policy together with a state set to a score.
4.2 Pairwise Loss and Learning Algorithm
As aforementioned, the goal of SOPR is to rank a set of policies correctly. We adopt a pairwise ranking loss following Burges et al. (2005):
(1) 
where is the parameter of our scoring function to learn, and are two policies with order : means is worse than , means is better than , and means that the two policies perform similarly.
It is not difficult to get that if to minimize the loss, the scoring function needs to correctly rank two policies, i.e., consistent with the ranking of their true performance.
The complete training algorithm is shown in Algorithm 1. Note that, in training, we sample multiple sub sets of states in the manner similar to minibatch training. In inference, we also sample multiple sub sets, compute a score using the well trained scoring function over each sub set for a test policy, and use the averaged score over those sub sets as the final score of the policy.
4.3 Discussions
We make some discussions about our algorithm in this subsection.
One may notice that we use only the states in the precollected historical data , while most previous OPE methods use both the states and immediate rewards in . There are several considerations for only using states. First, a new policy to be evaluated usually takes actions different from behavioral policies that generated the historical data, and is not available for a state with a new action . Thus, we do not directly use immediate rewards in this work. Second, as we study the problem of SOPR, we have a set of training policies with known performance available. The performance information of those policies is more reliable and direct signal for supervised learning than immediate rewards, since it comes from the online interactions with realworld systems and directly indicates the goodness of a policy. Third, consider the following formulation of the expected return of a policy,
(2) 
where denotes the stationary distribution under . Note that, only depends on policy , while is the same across all policies for the same task/game. Therefore, to rank different policies for a task, the more important part is , rather than the immediate rewards.
While we focus on offpolicy ranking in this work, our proposed scoring model can be easily applied to OPE. For this purpose, we need the true performance of training polices, and only performance ranking is not enough. Given the true performance of training policies, we can train the scoring model by minimizing the gap between the true performance and predicted performance of a policy.
5 Experiments
We compare SOPRT with different kinds of baseline OPE methods on various tasks, including evaluating policies learned by different algorithms, in different games, and with different kinds of datasets.
Metrics
We evaluate SOPRT and baseline OPE methods with two metrics, i.e., Spearman’s rank correlation coefficient and normalized Regret@k to reflect their performance of ranking candidate policies, which is aligned with a related work Paine et al. (2020). Specifically, Spearman’s rank correlation is the Pearson correlation between the ground truth rank sequence and the evaluated rank sequence of the candidate policies. Normalized Regret@k is the normalized performance gap between the truly best policy of all candidate policies and the truly best policy of the ranked top k policies, i.e., , where and are the ground truth performance of the truly best and the truly worst policy of all candidate policies, respectively. is the ground truth performance of the truly best policy of the ranked top k policies. We use in our experiments.
Baselines
We compare SOPRT with four representative baseline OPE methods, i.e., Fitted QEvaluation (FQE) Le et al. (2019), DualDICE Nachum et al. (2019), modelbased estimation (MB), and weighted importance sampling (IW). We leveraged a popular implementation ^{5}^{5}5https://github.com/googleresearch/googleresearch/tree/master/policy_eval of these baseline OPE algorithms.
Tasks and Datasets
We evaluate SOPRT and baseline OPE methods on D4RL dataset ^{6}^{6}6https://github.com/railberkeley/d4rl Fu et al. (2020) which is commonly used in offline RL studies. Specifically, we use 12 tasks of three Mujoco games, i.e., Hopperv2, HalfCheetahv2, and Walker2dv2, and four types of datasets for each game, i.e., expert, medium, mediumreplay (mreplay for short), and fullreplay (freplay for short). The expert and medium datasets are collected by a single expert and medium policy, respectively. The two policies are trained with Soft ActorCritic (SAC) algorithm Haarnoja et al. (2018) online, and the medium policy only achieves 1/3 performance of the expert policy. The fullreplay and mediumreplay datasets contain all the data collected during the training of the expert and medium policy previously mentioned, respectively.
Training Set and Validation Set
Training set and validation set consist of policies and their rank labels. The policies are collected during online SAC training. For each game, we collect 50 policy snapshots during the training process and get their performance by online evaluation using 100 MonteCarlo rollouts in the real environment. After the training process is finished, we randomly select 30 policies to form training policy set and another 10 policies to form validation policy set. The remaining 10 policies are used to form a test policy set. We will provide a detailed description of the test policy set later. Note that, in the training phase of SOPRT, only the rank of the policies are used as labels.
Test Set
In each task, we use two kinds of test policy sets to simulate two kinds of OPE cases.
In Test Set I, we investigate the capability of SOPRT and baseline OPE methods to rank and select good policies in offline RL settings. Specifically, Test Set I is composed of policies collected by running 3 popular offline RL algorithms, BEAR Kumar et al. (2019), CQL Kumar et al. (2020), and CRR Wang et al. (2020). The implementation of the three algorithms is based on a public codebase ^{7}^{7}7https://github.com/takuseno/d3rlpy. For each task, we run each algorithm until convergence and collect policy snapshots during training. To get the ground truth rank labels of the performance of these policies, we also evaluate the performance of each policy using 100 MonteCarlo rollouts in the real environment, and use the rank of the performance as rank labels. Then, we mix these policies and select 10 policies whose performance are evenly spaced in the performance range of all policies. In this way, the selected polices have diverse performance. In addition, mixing policies generated by different algorithms to form a test policy set is aligned with the practical cases where the sources of policies are various and unknown.
In Test Set II, we investigate the capability of SOPRT to rank policies that are approximately within the distribution of training policies. Because in practice, such as production development and update, the updated policies and the previously evaluated policies usually have many common properties. Therefore, it can be assumed that the policies to be evaluated and the policies that have been evaluated are in two similar distributions. To simulate this case in experiments, we leverage the 10 policies mentioned in the description of training policy set to form Test Set II. Because the 10 policies and the training policies are uniformly sampled from the same policy set, they are approximately within the same distribution.
5.1 Performance on Offline Learned Policies
We first evaluate SOPRT and baseline OPE methods on Test Set I. In the evaluation process, we use 3 random seeds for each experiment. Due to space limitation, we only present the results on the Hopper game here (Fig.2.(a) and (b)), and other results can be found in Appendix A.3. As we can see from Fig.2.(a) and (b), SOPRT achieves higher rank correlation coefficient and smaller regret value than baselines, which means SOPRT can rank different policies with higher accuracy and also figure out good policies from the candidate policies. In addition, SOPRT performs the most stably. That is, SOPRT does not have negative correlation results in all the tasks, whereas each of the baseline OPE method has one or more negative rank correlation results. Though in Walker2d and Halfcheetah (shown in Appendix A.3), SOPRT does not hold consistent superiority, it still performs the most stably. Fig.2.(c) and (d) show the overall performance of all five methods (SOPRT and four OPE baselines) in 12 tasks. As we can see from the results, SOPRT has the top performance of rank correlation in 5 tasks and the top performance of regret value in 6 tasks, both of which are the highest among all methods.
5.2 Performance on InDistribution Policies
Then, we evaluate SOPRT and OPE baselines on Test Set II. As aforementioned, policies in Test Set II obey a similar distribution with training policies. In addition, because these policies are collected during online training, they are irrelevant to the datasets, which is aligned with many practical cases where the policies to be evaluated are not designed or learned based on the dataset. We also run SOPRT and each baseline OPE algorithm with 3 different seeds. Results are shown in Fig.2 (the second row) and Appendix A.4. The results demonstrate that SOPRT outperforms baseline OPE methods dramatically in almost all the tasks. The overall results over 12 tasks are shown in Fig.2. (g) and (f). SOPRT achieves the top performance of rank correlation in 10 out of 12 tasks and the top performance of regret value in all the 12 tasks.
Compared with the results of ranking offline learned policies, the performance of SOPRT on ranking indistribution policies improves. To justify that the performance of SOPRT is affected by the distribution difference between the training policy set and test policy set, we measure the distribution distance between the two policy sets by:
where is the number of states in the dataset, , and are the number of policies in the training set and test set, respectively. Distance results are shown in Fig.3 and Appendix A.4. As can be seen, the distance between Test Set I and the training policy set is much larger than the distance between Test Set II and the training policy set.
5.3 Further Studies
We further conduct a set of experiments to better understand our algorithm.
Effect of Data Size
We investigate the sensibility of SOPRT and baseline OPE methods to the size of dataset. For each task, we sample different number of data from the original dataset in trajectory form. For the expert and medium datasets (composed of data collected by a single policy), trajectories are sampled randomly. For the mediumreplay dataset (composed of all data during training SAC in sequence), we fetch data sequentially. Note that, in this experiment, we use three datasets, i.e., expert, medium, and mediumreplay, as we suppose the first part of data in the fullreplay dataset is similar to the mediumreplay dataset. We set the number of data (states for SOPRT, tuples for baseline OPE methods) as 4k, 8k, 16k, and 32k. Note that, in the original datasets, the size of data is not identical among different tasks, but all of them contain more than 200k data. Due to space limitation, we only show the overall performance of all the methods in 9 tasks corresponding to 16k data in Fig.4. We can observe that, when using much less data, SOPRT still outperforms baseline OPE methods. In Appendix A.5, Fig.12 shows average performance of each method with different data size. The results indicate that SOPRT is robust with different data size. Even with very small data size like 4k, SOPRT still performs well. Additionally, SOPRT outperforms baseline OPE methods consistently.
Effect of Transformer Encoder
We investigate the effect of using Transformer encoder to encode state action pairs. To this end, we replace Transformer encoder with MLP encoder. We name the MLPbased SOPR as SOPRMLP. The number of hidden layers and units of hidden layers in MLP encoder is aligned with Transformer encoder used in SOPRT. Details of training and inference methods are also the same as SOPRT.
We compare the performance of SOPRMLP and SOPRT also with two test policy sets. Results of ranking Test Set I in 4 tasks (Hopper) are shown in Fig.5. Other results are shown in Appendix A.5. As can be seen from the results, SOPRT outperforms SOPRMLP in most of the tasks. Specifically, when ranking Test Set I, SOPRT achieves higher rank correlation in 8 out of 12 tasks and lower or the same (zero) regret value in 9 out of 12 tasks. For Test Set II, SOPRT achieves higher rank correlation in 9 out of 12 tasks, and lower or the same (zero) regret value in all the tasks. It demonstrates that the Transformer encoder is superior to MLP in terms of encoding state action pairs and representing policies in our SOPR method.
Performance Variance
In this part, we investigate the performance variance of SOPRT in inference. Especially, we want to see the variance of SOPRT with small data in inference. In many realworld applications, the inference speed of policy ranking matters a lot. Although our SOPRT is efficient due to its endtoend scoring model, we would also like to investigate whether SOPRT can use less data in inference than that used in the training phase in order to reduce inference time. To this end, in inference, we sample different number (1, 5, 10, 50, 100, and 200) of sub datasets for SOPRT to compute an average score for each policy, and then leverage the scores of different policies to rank them. The size of each sub dataset is the same as the batch size of training, which is provided in Appendix A.1. Given the number of sub datasets, the variance of rank correlation/regret value is computed by using five results on five random sets of sub datasets.
The relationship between the variance of rank correlation/regret value and the number of sampled sub datasets is shown in Fig.6 (Hopper) and Appendix A.5. As can be seen from the results, when using different number of sub datasets in inference, even using only one sub dataset, the standard deviations of both rank correlation and regret of SOPRT are small. As the number of sub datasets increases, the standard deviation decreases gradually, and thus SOPRT achieves more stable performance.
6 Conclusions and Future Work
In this work, we defined a new problem, supervised offpolicy ranking (SOPR), which aims to score target policies for the purpose of correct ranking instead of precise return estimation, by leveraging a set of training policies with known performance (ranking) in addition to the offpolicy data. We proposed a method to learn a Transformer based scoring function, which first represents a policy to be evaluated by a set of state action pairs and then maps such a set to a score indicating the goodness of the policy. The scoring function is trained by minimizing pairwise ranking loss. Experiments demonstrate the effectiveness and stability of SOPRT over strong baseline OPE methods in terms of both rank correlation and regret value.
As a newly defined problem, our work is just a first and very preliminary step towards solving it. There are many possible future directions to explore. First, beyond the stateaction pairbased representations in this work, how to better represent a policy? For example, considering the existence of many OPE methods, the values of a policy estimated by those methods can be used to represent the policy, and one can conduct supervised learning to combine the estimations of those methods. Furthermore, the policy values estimated by those OPE methods can serve as features and enhance the representations of a policy adopted in this work. Second, we employed a Transformer based scoring function in this work. It is interesting to explore other possibilities for the scoring function. Third, we used the pairwise loss in our algorithm. There are other ranking losses (e.g., listwise loss Cao et al. (2007)) which are worth trying. Fourth, since we conducted supervised learning, a limitation of our algorithm is that its effectiveness will be heavily impacted by the quality of training policies. We will study and quantify the impact of the quality of training policies. Fifth, as a supervised learning problem, there are many theoretical questions to answer Vapnik (1999, 2013). For example, is a learning algorithm consistent for supervised OPE or supervised OPR? What is the rate of convergence of a learning algorithm? What is the generalization bound of a learning algorithm? We hope that our work can inspire more research along the direction of OPE/OPR.
References

Neural combinatorial optimization with reinforcement learning
. arXiv preprint arXiv:1611.09940. Cited by: §4.1. 
Critic Regularized Regression.
HandsOn Machine Learning with R
, pp. 121–140. Note: CRR External Links: Document, arXiv:2006.15134v2 Cited by: §2.  Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pp. 89–96. Cited by: §4.2.
 Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §6.
 Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §2.
 More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, pp. 1447–1456. Cited by: §1.
 D4rl: datasets for deep datadriven reinforcement learning. arXiv preprint arXiv:2004.07219. Cited by: §1, §5.
 Benchmarking Batch Deep Reinforcement Learning Algorithms. pp. 1–13. External Links: 1910.01708, Link Cited by: §2.
 Offpolicy deep reinforcement learning without exploration _Full. 36th International Conference on Machine Learning, ICML 2019 2019June, pp. 3599–3609. External Links: 1812.02900, ISBN 9781510886988 Cited by: §2.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870. Cited by: §5.
 Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pp. 2605–2613. Cited by: §1.

Bootstrapping with models: confidence intervals for offpolicy evaluation
. InProceedings of the AAAI Conference on Artificial Intelligence
, Vol. 31. Cited by: §1, §2.  Q () with offpolicy corrections. In International Conference on Algorithmic Learning Theory, pp. 305–320. Cited by: §1.
 Offpolicy evaluation via offpolicy classification. arXiv preprint arXiv:1906.01624. Cited by: §2.
 When to trust your model: modelbased policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §2.
 Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. Cited by: §2.
 Modelbased reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §2.
 Attention, learn to solve routing problems!. arXiv preprint arXiv:1803.08475. Cited by: §4.1.
 Statistical bootstrapping for uncertainty estimation in offpolicy evaluation. arXiv preprint arXiv:2007.13609. Cited by: §1.
 Stabilizing offpolicy Qlearning via bootstrapping error reduction. Advances in Neural Information Processing Systems 32 (NeurIPS). External Links: 1906.00949, ISSN 10495258 Cited by: §2, §5.
 Conservative qlearning for offline reinforcement learning. arXiv preprint arXiv:2006.04779. Cited by: §5.
 Batch policy learning under constraints. 36th International Conference on Machine Learning, ICML 2019 2019June (i), pp. 6589–6600. Note: FQE External Links: 1903.08738, ISBN 9781510886988 Cited by: §1, §2, §5.
 Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §2.
 Breaking the curse of horizon: infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366. Cited by: §1, §2.
 Bias and variance in value function estimation. In Proceedings of the twentyfirst international conference on Machine learning, pp. 72. Cited by: §1, §2.
 Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062. Cited by: §1.
 Dualdice: behavioragnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: §1, §2, §5.
 Reinforcement learning via fenchelrockafellar duality. arXiv preprint arXiv:2001.01866. Cited by: §1, §2.
 Reinforcement learning for solving the vehicle routing problem. arXiv preprint arXiv:1802.04240. Cited by: §4.1.
 Hyperparameter Selection for Offline Reinforcement Learning. External Links: 2007.09055, Link Cited by: §2, §5.
 Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §2.
 Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §1, §2.
 Highconfidence offpolicy evaluation. In TwentyNinth AAAI Conference on Artificial Intelligence, Cited by: §1, §2.

An overview of statistical learning theory
. IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §6.  The nature of statistical learning theory. Springer science & business media. Cited by: §6.
 Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §4.1.
 Pointer networks. arXiv preprint arXiv:1506.03134. Cited by: §4.1.
 Minimax model learning. In International Conference on Artificial Intelligence and Statistics, pp. 1612–1620. Cited by: §2.
 Empirical Study of OffPolicy Policy Evaluation for Reinforcement Learning. External Links: 1911.06854, Link Cited by: §2.
 Critic regularized regression. arXiv preprint arXiv:2006.15134. Cited by: §5.
 Towards optimal offpolicy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, pp. 9668–9678. Cited by: §1, §2.

Multidecoder attention model with embedding glimpse for solving vehicle routing problems
. arXiv preprint arXiv:2012.10638. Cited by: §4.1.  Offpolicy evaluation via the regularized lagrangian. arXiv preprint arXiv:2007.03438. Cited by: §1, §2.
 Autoregressive dynamics models for offline policy evaluation and optimization. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
 Gendice: generalized offline estimation of stationary values. arXiv preprint arXiv:2002.09072. Cited by: §1, §2.
Appendix A Appendix
a.1 Model and Training Configurations.
Table 1 lists the configurations of our model and training process.
Hyperparameter  Value 

Input linear projection layer  ((dim_s+dim_a), 64) 
Lowlevel encoder  n_layers=2, n_head=2, dim_feedforward=128, dropout=0.1 
Highlevel encoder  n_layers=6, n_head=8, dim_feedforward=512, dropout=0.1 
Output linear projection layer  (256, 1) 
Optimizer  Adam 
Learning rate  0.001 
Batch size  
Number of clusters 
a.2 Computational Resource and Time Cost
Our experiments are run with a Nvidia Tesla P100 GPU. Table 2 and 4 show training and inference time cost of SOPRT, respectively. Table 3 and 5 show time cost of all algorithms. Note that, the time cost of SOPRT in inference depends on the number of sub datasets used to calculate an average score for a policy. Unless otherwise specified, in inference, we use all () sub datasets that are used in training. We observe that, as shown in the last part of Section 5.3 and the last part of Appendix (Fig.16), SOPRT with 5 sub datasets achieves nearly the same performance as using 200 sub datasets. Namely, to score and rank 10 policies in each task shown in Table 4, SOPRT takes less than 25.24 seconds to get a good ranking, which is comparable to IW and much faster than FQE and DualDICE.
Task Name  Seed0  Seed1  Seed2  Average 

halfcheetahexpert  3937.8  3987.3  3993.0  3972.7 
halfcheetahmedium  3816.1  3973.8  3923.8  3904.6 
halfcheetahmediumreplay  4653.5  4987.4  4969.4  4870.1 
halfcheetahfullreplay  5283.5  5151.5  6300.0  5578.3 
hopperexpert  3204.7  3214.5  3228.9  3216.0 
hoppermedium  3216.3  3177.5  3298.5  3230.8 
hoppermediumreplay  3799.9  3811.7  4123.5  3911.7 
hopperfullreplay  3837.4  3795.7  3909.6  3847.6 
walker2dexpert  3702.3  4030.1  3678.7  3803.7 
walker2dmedium  3601.0  3740.9  3647.4  3663.1 
walker2dmediumreplay  4687.4  4719.2  4606.0  4670.9 
walker2dfullreplay  4683.7  4566.4  4648.7  4632.9 
Task Name  SOPRT  FQE  DualDICE  MB  IW 

halfcheetahexpert  3972.7  /  /  6610.0  3841.2 
halfcheetahmedium  3904.6  /  /  4386.1  4000.8 
halfcheetahmediumreplay  4870.1  /  /  4682.8  2762.3 
halfcheetahfullreplay  5578.3  /  /  5037.2  3370.9 
hopperexpert  3216.0  /  /  7614.4  3695.9 
hoppermedium  3230.8  /  /  4764.0  3684.9 
hoppermediumreplay  3911.7  /  /  5104.8  3034.7 
hopperfullreplay  3847.6  /  /  4776.6  3523.0 
walker2dexpert  3803.7  /  /  4582.4  3371.0 
walker2dmedium  3663.1  /  /  4629.5  3542.5 
walker2dmediumreplay  4670.9  /  /  4780.0  2933.2 
walker2dfullreplay  4632.9  /  /  4732.3  3532.7 
Task Name  Test Set I  Test Set II  

Seed0  Seed1  Seed2  Average  Seed0  Seed1  Seed2  Average  
halfcheetahexpert  799.9  770.9  775.5  782.1  600.4  515.0  523.6  546.3 
halfcheetahmedium  741.7  775.9  746.5  754.7  532.5  543.0  539.8  538.4 
halfcheetahmediumreplay  854.3  864.5  849.6  856.1  658.6  660.1  668.9  662.5 
halfcheetahfullreplay  1038.1  1050.7  940.1  1009.6  657.7  664.7  658.8  660.4 
hopperexpert  843.4  841.4  833.6  839.4  490.3  502.2  494.7  495.7 
hoppermedium  696.6  701.1  761.8  719.8  481.9  488.1  500.6  490.2 
hoppermediumreplay  772.9  790.1  803.7  788.9  586.5  583.7  591.2  587.1 
hopperfullreplay  737.2  735.0  745.6  739.3  586.0  643.9  650.3  626.7 
walker2dexpert  814.1  809.8  785.0  803.0  544.5  518.8  531.3  531.5 
walker2dmedium  906.5  874.7  893.9  891.7  544.7  547.8  520.3  537.6 
walker2dmediumreplay  781.7  781.8  779.9  781.2  652.7  663.5  659.0  658.4 
walker2dfullreplay  946.9  939.6  946.7  944.4  659.3  671.7  656.3  662.4 
Task Name  SOPRT  FQE  DualDICE  MB  IW 

halfcheetahexpert  782.1  63262.7  70975.3  142.4  27.8 
halfcheetahmedium  754.7  39271.1  48123.2  121.3  19.0 
halfcheetahmediumreplay  856.1  38574.8  44506.9  96.7  14.3 
halfcheetahfullreplay  1009.6  39221.6  48345.0  141.0  21.1 
hopperexpert  839.4  39055.9  50544.5  135.0  19.2 
hoppermedium  719.8  72422.3  79971.6  163.9  26.0 
hoppermediumreplay  788.9  39408.0  53415.2  137.6  27.1 
hopperfullreplay  739.3  299882.7  299237.4  446.2  48.8 
walker2dexpert  803.0  38092.9  58832.3  125.0  19.8 
walker2dmedium  891.7  39518.1  47013.5  134.2  20.6 
walker2dmediumreplay  781.2  39394.2  62608.0  142.2  20.3 
walker2dfullreplay  944.4  47131.7  51518.7  214.9  26.6 
Task Name  SOPRT  FQE  DualDICE  MB  IW 

halfcheetahexpert  546.3  48336.0  47690.6  125.8  19.6 
halfcheetahmedium  538.4  45660.3  52757.4  105.4  19.7 
halfcheetahmediumreplay  662.5  42467.1  56625.4  82.9  14.6 
halfcheetahfullreplay  660.4  43887.4  47438.6  110.1  21.1 
hopperexpert  495.7  44551.3  48764.1  103.4  19.3 
hoppermedium  490.2  51706.9  59532.8  152.1  25.2 
hoppermediumreplay  587.1  44676.7  58953.9  122.5  25.9 
hopperfullreplay  626.7  60112.8  90679.3  180.8  37.2 
walker2dexpert  531.5  52014.5  49146.8  112.1  19.6 
walker2dmedium  537.6  49727.6  53583.4  115.2  20.7 
walker2dmediumreplay  658.4  44017.6  49982.8  108.2  20.2 
walker2dfullreplay  662.4  44859.8  56993.5  129.3  27.3 
a.3 Additional Results of Ranking Offline Learned Policies (Corresponding to Section 5.1)
In Section 5.1, we presented rank correlation and regret value of using each method to rank offline learned policies (Test Set I) in the Hopper game. Here, Fig. 7 shows all the results in three games.
As can be seen from the results, SOPRT outperforms baseline OPE methods consistently in four tasks of the Hopper game. Though in Walker2d and Halfcheetah, SOPRT does not hold consistent superiority, it performs the most stably. That is, SOPRT does not have negative correlation results in all the settings, whereas all the baseline OPE methods have one or more negative correlation results.
a.4 Additional Results of Ranking InDistribution Policies (Corresponding to Section 5.2)
In Section 5.2, we presented the results of ranking indistribution policies (i.e., Test Set II, that are sampled from the same policy set as sampling training policies) in the Hopper game. Here, Fig.8 shows the results in all the games. As can be seen from the results, SOPRT achieves very high rank correlation and zero regret values in all the games and training data settings. In addition, the variance caused by random seeds is very small.
The performance improvement of SOPRT in Test Set II compared with Test Set I can be explained by the degree of policy distribution shift. As we mentioned in Section 5.2, to justify that the policy distribution shift of Test Set I is greater than Test Set II, we measure the distance between the set of training policies and that of test policies using the metric:
where is the number of states in the dataset, .
Fig.9 shows the distance results of Test Set I and Test Set II. We can see that, in each data setting, the distance between Test Set I and the training policy set is larger than the distance between Test Set II and the training policy set.
a.5 Additional Results of Section 5.3
Effect of Data Size
In Section 5.3, we investigated the overall performance of each method over 3 games when the size of offpolicy dataset is small, i.e., 16k. Here, Fig.10 and Fig.11 show performance of each method in each individual game. Fig.12 shows average performance of each method with different data size.
Note that, for our SOPRT method, results in this part (Fig. 10  12) correspond to the case where both training and testing are conducted on the same dataset, while the results in Fig. 16 correspond to using different amount of data (sub datasets) only in inference, and using the original dataset in training. As we aim to investigate how data size affects the performance of ranking policies, the two test policy sets are fixed while we change the data size.
As shown in Fig.10 and Fig.11, SOPRT outperforms baseline OPE methods in most of the data settings. In addition, the performance of SOPRT only degrades slightly with respect to the shrinkage of the data size.
Further, as can be seen from Fig.12, SOPRT is robust with different data size. Even with very small data size like 4k, SOPRT still performs well. Particularly, SOPRT outperforms baseline OPE methods consistently.
Transformer Encoder vs. MLP Encoder
In Section 5.3, we also investigated the performance difference between using Transformer encoder and MLP encoder in our SOPR framework. The corresponding two SOPR models are named SOPRT and SOPRMLP, respectively. Here, the performance of SOPRT and SOPRMLP on ranking policies in each individual task is shown in Fig.13 (Test Set I) and Fig.15 (Test Set II). The overall performance of SOPRT and SOPRMLP over all the tasks is shown in Fig.14 (Test Set I) and Fig.15.(d) (Test Set II). As can be seen from the results, SOPRT outperforms SOPRMLP in most tasks in terms of both rank correlation and regret value.
Because the regret values of both SOPRT and SOPRMLP in Test Set II over all the tasks are zero, we do not show the regret value and only show rank correction results in Fig.15. The results indicate that both SOPRT and SOPRMLP performs well in Test Set II (indistribution policies). Their overall performance rank as shown in Fig.15.(d) demonstrates that SOPRT is still better than SOPRMLP.
Effect of Using Different Number of Sub Datasets in Inference
In the last part of Section 5.3, we investigated the effect of using different number of sub datasets in inference. Because in inference, SOPRT samples several (denoted as ) sub datasets used in the training phase to calculate an average score of a policy over the sub datasets, the average score may be different with respect to the sub datasets, and thus the ranking result of a policy set may also be different. Therefore, here we study the average and variance of the performance of SOPRT, and their relationship with the value of . To this end, we set 1, 5, 10, 50, 100, and 200. For each value, we sample sub datasets for 5 times and use SOPRT to rank policies over each sampling. In each time, the sub datasets are randomly sampled from those used in the training phase with different random seeds. Then, we get 5 rank correlation/regret value results and calculate the average and standard deviation of the results.
Fig.16 shows the average results with a std bar. As we can see from the results, in all the tasks, the number of sub datasets makes little difference on the average performance of SOPRT. In addition, as the number of sub datasets increases, the performance variance of SOPRT decreases fast. In almost all the tasks, the performance variance is quite small. Therefore, we can draw a conclusion that the performance of SOPRT is stable even though it only uses little amount of sub datasets (e.g., ) in inference. Further, the inference time cost can be reduced by using a small number of sub datasets in practice.