Off-policy evaluation (OPE), which aims to estimate the online/true performance111In this paper, we will use online performance and true performance interchangeably. of a policy using only pre-collected historical data generated by other policies, is critical to many real-world applications, in which evaluating a poorly performed policy online might be prohibitively expensive (e.g., in trading, advertising, traffic control) or even dangerous (e.g., in robotics, autonomous vehicles, drug trials).
Existing OPE methods can be roughly categorized into three classes: distribution correction based Thomas et al. (2015); Liu et al. (2018); Hanna et al. (2019); Xie et al. (2019); Nachum et al. (2019); Zhang et al. (2020); Nachum and Dai (2020); Yang et al. (2020); Kostrikov and Nachum (2020), model estimation based Mannor et al. (2004); Thomas and Brunskill (2016); Hanna et al. (2017); Zhang et al. (2021), and Q-estimation based Le et al. (2019); Munos et al. (2016); Harutyunyan et al. (2016); Precup (2000); Farajtabar et al. (2018) methods. While those methods are based on different assumptions and with different formulations, most of them (1) focus on precisely estimating the expected return of a target policy using pre-collected historical data, and (2) perform unsupervised estimation without directly leveraging the online performance of previously deployed policies.
We notice that there are some mismatches between the settings of those OPE methods and the OPE problem in real-world applications. First, in many applications, we do not need to estimate the exact true performance of a target policy; instead, what we need is to compare the true performance of a set of candidate policies and identify the best one which will then be deployed into real-world systems. That is, correct ranking of policies rather than precise return estimation is the goal of off-policy evaluation. Second, in real-world applications, we usually know the true performance of polices that have been deployed into real-world systems and interacted with real-world users. Such information is not well exploited in today’s OPE methods.
Based on these observations, in this work, we define a new problem, supervised off-policy ranking (SOPR for short), which is different from previous OPE in two aspects. First, in SOPR, the true performance of a set of pre-collected policies is available in addition to offline-policy trajectory data. Second, SOPR aims to correctly rank new policies, rather than accurately estimate the expected returns of the policies.
A straightforward way to rank policies is to first use a scoring model to score each policy and then rank them based on their scores. Following this idea, we propose a supervised learning based method to train a scoring model. The scoring model takes logged states in historical data and the actions taken offline by a policy as raw features. Considering that there might be plenty of states in pre-collected data, we first adopt k-means clustering to cluster similar states, use a Transformer encoder to encode state-action pairs in each cluster, then another Transformer encoder to encode information of those clusters, and finally map the outputs of the second encoder to a score by a multi-layer perceptron (MLP). That is, our scoring model consists of two Transformer encoders and a MLP, which are jointly trained to correctly rank the pre-collected policies with known performance. Our method is named as SOPR-T, where “T” stands for Transformer based model.
Experiments on multiple Mujoco games and different types of D4RL datasets Fu et al. (2020) show that our SOPR-T outperforms strong baseline OPE methods in terms of both effectiveness (including rank correlation and the performance gap between the ground-truth top 3 policies and the identified top 3 policies) and stability.
The main contributions of this work are summarized as follows:
We point out that OPE should focus on policy ranking rather than exactly estimating the returns of policies, which simplifies the task of OPE.
According to our knowledge, we are the first to introduce supervised learning into OPE. We only take a preliminary step towards supervised OPE in this work and define the problem of supervised off-policy ranking. We hope that our work can inspire more solid work along this direction.
We propose a Transformer encoder based hierarchical model for policy scoring. Experiments demonstrate the effectiveness and stability of our method. Our code and data have been released 222https://github.com/SOPR-T/SOPR-T.
2 Related Work
Distribution correction based methods aim to correct the distribution shift of data from the behavior data distribution to the target data distribution. These methods leverage importance sampling (IS) to correct the weight of reward used to calculate the average return. Early methods mainly focus on the distribution correction of a trajectory or truncated trajectory, such as trajectory-wise IS, step-wise IS, and normalized (weighted) step-wise IS Thomas et al. (2015); Voloshin et al. (2019)
. Because of involving the multiplication of IS weights over time steps, these methods suffer from large variance. Even though using weight normalization, the estimation variance is still too largeLevine et al. (2020). Recently, more studies focus on correcting the stationary state distribution Liu et al. (2018); Xie et al. (2019); Nachum et al. (2019); Zhang et al. (2020); Nachum and Dai (2020); Yang et al. (2020). However, not matter using which kind of distribution correction, if the coverage of dataset against the target dataset is small, big estimation error is inevitable.
Instead of only leveraging data in the given dataset, model-based methods estimate the environment model including a state transition distribution and a reward function, and generate data using the the simulated environment and the target policy Mannor et al. (2004); Hanna et al. (2017); Zhang et al. (2021)
. Then, the expected return of the target policy can be estimated using the returns of Monte-Carlo (MC) rollouts in the simulated environment. Recently, model-based reinforcement learning (RL) has achieved a lot of progress in terms of data efficiency and model estimation accuracyJanner et al. (2019); Voloshin et al. (2021); Kaiser et al. (2019). However, these works focus on online RL, where the estimation bias can be reduced by collecting more data from the environment. In contrast, the dataset is static in OPE. Thus, for the state-action pairs that are within the distribution of the target data but out of the distribution of the given dataset, traditional model-based methods may generate large estimation error in terms of both reward and state transition. Moreover, for the model-based methods that resort to MC rollouts to estimate the return, one-step erroneous transition estimation may cause a great estimation error.
Q-estimation based OPE methods estimate a Q-function by leveraging Bellman expectation backup operator Sutton and Barto (2018) iteratively based on the off-policy data. A representative method is Fitted Q evaluation (FQE) Le et al. (2019)
, which learns a neural network to approximate the Q function of the target policy. However, FQE suffers large uncertainty about the value of the unseen state-action pairs. Moreover, evaluating each target policy needs a training procedure, which is time-consuming. Another important branch of OPE study is doubly robust (DR) OPEThomas and Brunskill (2016); Dudík et al. (2011); Jiang and Li (2016), which combines reward function estimation or Q-function estimation with IS. Theoretical analysis proves that if either the behavior policy estimate or the reward or Q estimate is accurate, the DR estimate will be close to the true value. However, IS-based, model-based, and Q-estimation-based methods all have drawbacks that cannot be offset by each other.
Off-Policy Classification (OPC) based OPE method Irpan et al. (2019) aims at the same goal with our work, i.e. to rank several candidate policies. OPC does not estimate the expected return of a policy but instead uses a statistic to measure the success rate of the actions generated by the policy. However, OPC needs to leverage a Q-function and thus its performance is limited by the performance of Q-function learning. In addition, a recent work Paine et al. (2020) demonstrated that using either some offline RL algorithms or FQE to learn a Q-function in OPC performs worse than FQE in most cases.
Offline RL also involves OPE problem. Good policy evaluation can tell how good the current learned policy is and the direction to improve the policy, which is important for offline RL. Actually, the critic in the actor-critic learning architecture plays the role of policy evaluation. Some works Fujimoto et al. (2019a, b); Kumar et al. (2019); Boehmke et al. (2020) focus on mitigating the extrapolation error caused by the distribution mismatch between the bootstrapping actions (used in the critic learning) and the actions in the dataset. A common method is to constrain the distribution of the data that can be visited by the learned policy from being mismatched with the dataset. However, the performance of the resulting policy is also limited by this kind of constraint. Particularly, when the distribution of the dataset is distant from that corresponding to the optimal policy, the aforementioned constraint will severely hinder policy optimization.
3 The Problem of Supervised Off-policy Ranking
In this section, we first give some notations and then formally describe the problem of supervised off-policy ranking.
We consider OPE in a Markov Decision Process (MDP) defined by a tuple. and denote state and action space. and are state transition distribution and reward function. is a discount factor. The expected return of a policy is defined as , where , , and is the time horizon of the MDP.
The goal of OPE is to evaluate a policy without running it in the environment. Traditional OPE methods estimate the expected return of a policy by leveraging a pre-collected dataset , where , composed of trajectories, which are generated by some other policies (usually called behavioral policy).
In many real-world applications where we need to evaluate a policy offline before deploying it into online systems, in addition to the pre-collected dataset, we can also collect a set of previously deployed policies and their true performance after being deployed into online systems. Clearly, those information is helpful for off-policy evaluation, but unfortunately was ignored in previous OPE methods. In this work, we define a new kind of OPE problem, the supervised OPE problem, as below.
Supervised Off-policy Evaluation: given a pre-collected dataset with trajectories and a set of pre-collected policies with known performance, estimate the performance of a target policy or a set of target policies without interacting with the environment.
Comparing with previously studied OPE problems, our new problem has more information available, the set of pre-collected policies
with known performance, which can serve as supervision while we build an OPE model/method. This is why we call this problem “supervised" OPE, as we have label information (i.e., the true performance) for some policies available in training. In contrast, previous OPE methods can be viewed as unsupervised learning, since they do not directly learn from the online performance of pre-collected policies. In real-world applications such as advertising and recommendation systems, we can get the true performance of previously deployed polices by observing and counting user click behaviors.
In real-world applications we often need to compare a set of candidate policies and choose a good one from them, where what we really need is the correct ordering of a set of policies, rather than the exact value/performance of each policy. Thus, we define a variant of the supervised OPE problem, which is called supervised off-policy ranking.
Supervised Off-policy Ranking: given a pre-collected dataset with trajectories and a set of pre-collected policies together with their relative goodness, rank a set of target policies without interacting with the environment.
It is not difficult to see that policy ranking is relatively easier than policy evaluation, since accurate evaluations of policies inherently implies correct ranking of them, but correct ranking does not need accurate evaluations. In the following parts of this paper, we focus on supervised off-policy ranking (SOPR), considering its practical value and simplicity.
4 Our Method
We introduce our method for SOPR in this section, which trains a scoring model to score and rank target policies. We start with introducing policy representation (Section 4.1
), and then loss function and training pipeline (Section4.2).
4.1 Policy Representation
Policies can be of very different formats: a policy can be a set of expert designed rules, a linear function of states, decision trees over states, or deep neural networks, e.g., convolutional neural networks, recurrent neural networks, and attention based networks. To map a policy to a score, the first question to answer is how to represent policy with very diverse formats.
To handle all kinds of policies, we should leverage their shared points rather than differences. Obviously, we cannot use the internal parameters of policies, because different policies may have different numbers of parameters and some policy may even do not have parameters. Fortunately, all the policies for the same task or game share the same input space, the state space , and the same output space . Therefore, we propose to build the representation of a policy upon state action pairs.
Let denote the set of all the states in the pre-collected data . For a policy to rank, we let it take actions for all the states in and obtain a dataset , where . Now, any policy can be represented by a dataset in the same format. 333For simplicity of description, we consider deterministic policies here. For stochastic policies, we can use the distribution over actions for a state and get a dataset .
Now the question becomes how to map a set of points/pairs444Each state action pair can be viewed as a point in a high-dimensional space. to a score that indicates the goodness of the policy . Following Kool et al. (2018); Nazari et al. (2018); Bello et al. (2016); Xin et al. (2020); Vinyals et al. (2015), we use a Transformer encoder Vaswani et al. (2017)
to encode all those points. In particular, we first project each point into an embedding vector, then use a few self-attention layers to get high level representations of those points, aggregate them by average pooling to get the representation vector of the dataset (and thus the policy), and finally linearly project the vector to a score.
A computational challenge is that there could be millions of states in pre-collected historical data. It is impossible for Transformer to handle such a large scale of data. Our solution is to down sample the states and encode the state action pairs in a hierarchical way as follows:
Randomly sample a sub set of states from and cluster them into clusters by k-means.
Let a policy take actions over states in and obtain a set of state action pairs.
Use a low-level Transformer encoder to encode all the state action pairs in a cluster and get a vector representation by average pooling for each cluster.
Use a high-level Transformer encoder to encode all the clusters and get a vector representation by average pooling for the policy.
Linearly project the vector in the above step to a score.
Figure 1 illustrates our scoring function. Since our scoring function is based on Transformer, we name our method SOPR-T. Let denote our scoring function with parameter , which maps a policy together with a state set to a score.
4.2 Pairwise Loss and Learning Algorithm
As aforementioned, the goal of SOPR is to rank a set of policies correctly. We adopt a pairwise ranking loss following Burges et al. (2005):
where is the parameter of our scoring function to learn, and are two policies with order : means is worse than , means is better than , and means that the two policies perform similarly.
It is not difficult to get that if to minimize the loss, the scoring function needs to correctly rank two policies, i.e., consistent with the ranking of their true performance.
The complete training algorithm is shown in Algorithm 1. Note that, in training, we sample multiple sub sets of states in the manner similar to mini-batch training. In inference, we also sample multiple sub sets, compute a score using the well trained scoring function over each sub set for a test policy, and use the averaged score over those sub sets as the final score of the policy.
We make some discussions about our algorithm in this subsection.
One may notice that we use only the states in the pre-collected historical data , while most previous OPE methods use both the states and immediate rewards in . There are several considerations for only using states. First, a new policy to be evaluated usually takes actions different from behavioral policies that generated the historical data, and is not available for a state with a new action . Thus, we do not directly use immediate rewards in this work. Second, as we study the problem of SOPR, we have a set of training policies with known performance available. The performance information of those policies is more reliable and direct signal for supervised learning than immediate rewards, since it comes from the online interactions with real-world systems and directly indicates the goodness of a policy. Third, consider the following formulation of the expected return of a policy,
where denotes the stationary distribution under . Note that, only depends on policy , while is the same across all policies for the same task/game. Therefore, to rank different policies for a task, the more important part is , rather than the immediate rewards.
While we focus on off-policy ranking in this work, our proposed scoring model can be easily applied to OPE. For this purpose, we need the true performance of training polices, and only performance ranking is not enough. Given the true performance of training policies, we can train the scoring model by minimizing the gap between the true performance and predicted performance of a policy.
We compare SOPR-T with different kinds of baseline OPE methods on various tasks, including evaluating policies learned by different algorithms, in different games, and with different kinds of datasets.
We evaluate SOPR-T and baseline OPE methods with two metrics, i.e., Spearman’s rank correlation coefficient and normalized Regret@k to reflect their performance of ranking candidate policies, which is aligned with a related work Paine et al. (2020). Specifically, Spearman’s rank correlation is the Pearson correlation between the ground truth rank sequence and the evaluated rank sequence of the candidate policies. Normalized Regret@k is the normalized performance gap between the truly best policy of all candidate policies and the truly best policy of the ranked top k policies, i.e., , where and are the ground truth performance of the truly best and the truly worst policy of all candidate policies, respectively. is the ground truth performance of the truly best policy of the ranked top k policies. We use in our experiments.
We compare SOPR-T with four representative baseline OPE methods, i.e., Fitted Q-Evaluation (FQE) Le et al. (2019), DualDICE Nachum et al. (2019), model-based estimation (MB), and weighted importance sampling (IW). We leveraged a popular implementation 555https://github.com/google-research/google-research/tree/master/policy_eval of these baseline OPE algorithms.
Tasks and Datasets
We evaluate SOPR-T and baseline OPE methods on D4RL dataset 666https://github.com/rail-berkeley/d4rl Fu et al. (2020) which is commonly used in offline RL studies. Specifically, we use 12 tasks of three Mujoco games, i.e., Hopper-v2, HalfCheetah-v2, and Walker2d-v2, and four types of datasets for each game, i.e., expert, medium, medium-replay (m-replay for short), and full-replay (f-replay for short). The expert and medium datasets are collected by a single expert and medium policy, respectively. The two policies are trained with Soft Actor-Critic (SAC) algorithm Haarnoja et al. (2018) online, and the medium policy only achieves 1/3 performance of the expert policy. The full-replay and medium-replay datasets contain all the data collected during the training of the expert and medium policy previously mentioned, respectively.
Training Set and Validation Set
Training set and validation set consist of policies and their rank labels. The policies are collected during online SAC training. For each game, we collect 50 policy snapshots during the training process and get their performance by online evaluation using 100 Monte-Carlo rollouts in the real environment. After the training process is finished, we randomly select 30 policies to form training policy set and another 10 policies to form validation policy set. The remaining 10 policies are used to form a test policy set. We will provide a detailed description of the test policy set later. Note that, in the training phase of SOPR-T, only the rank of the policies are used as labels.
In each task, we use two kinds of test policy sets to simulate two kinds of OPE cases.
In Test Set I, we investigate the capability of SOPR-T and baseline OPE methods to rank and select good policies in offline RL settings. Specifically, Test Set I is composed of policies collected by running 3 popular offline RL algorithms, BEAR Kumar et al. (2019), CQL Kumar et al. (2020), and CRR Wang et al. (2020). The implementation of the three algorithms is based on a public codebase 777https://github.com/takuseno/d3rlpy. For each task, we run each algorithm until convergence and collect policy snapshots during training. To get the ground truth rank labels of the performance of these policies, we also evaluate the performance of each policy using 100 Monte-Carlo rollouts in the real environment, and use the rank of the performance as rank labels. Then, we mix these policies and select 10 policies whose performance are evenly spaced in the performance range of all policies. In this way, the selected polices have diverse performance. In addition, mixing policies generated by different algorithms to form a test policy set is aligned with the practical cases where the sources of policies are various and unknown.
In Test Set II, we investigate the capability of SOPR-T to rank policies that are approximately within the distribution of training policies. Because in practice, such as production development and update, the updated policies and the previously evaluated policies usually have many common properties. Therefore, it can be assumed that the policies to be evaluated and the policies that have been evaluated are in two similar distributions. To simulate this case in experiments, we leverage the 10 policies mentioned in the description of training policy set to form Test Set II. Because the 10 policies and the training policies are uniformly sampled from the same policy set, they are approximately within the same distribution.
5.1 Performance on Offline Learned Policies
We first evaluate SOPR-T and baseline OPE methods on Test Set I. In the evaluation process, we use 3 random seeds for each experiment. Due to space limitation, we only present the results on the Hopper game here (Fig.2.(a) and (b)), and other results can be found in Appendix A.3. As we can see from Fig.2.(a) and (b), SOPR-T achieves higher rank correlation coefficient and smaller regret value than baselines, which means SOPR-T can rank different policies with higher accuracy and also figure out good policies from the candidate policies. In addition, SOPR-T performs the most stably. That is, SOPR-T does not have negative correlation results in all the tasks, whereas each of the baseline OPE method has one or more negative rank correlation results. Though in Walker2d and Halfcheetah (shown in Appendix A.3), SOPR-T does not hold consistent superiority, it still performs the most stably. Fig.2.(c) and (d) show the overall performance of all five methods (SOPR-T and four OPE baselines) in 12 tasks. As we can see from the results, SOPR-T has the top performance of rank correlation in 5 tasks and the top performance of regret value in 6 tasks, both of which are the highest among all methods.
5.2 Performance on In-Distribution Policies
Then, we evaluate SOPR-T and OPE baselines on Test Set II. As aforementioned, policies in Test Set II obey a similar distribution with training policies. In addition, because these policies are collected during online training, they are irrelevant to the datasets, which is aligned with many practical cases where the policies to be evaluated are not designed or learned based on the dataset. We also run SOPR-T and each baseline OPE algorithm with 3 different seeds. Results are shown in Fig.2 (the second row) and Appendix A.4. The results demonstrate that SOPR-T outperforms baseline OPE methods dramatically in almost all the tasks. The overall results over 12 tasks are shown in Fig.2. (g) and (f). SOPR-T achieves the top performance of rank correlation in 10 out of 12 tasks and the top performance of regret value in all the 12 tasks.
Compared with the results of ranking offline learned policies, the performance of SOPR-T on ranking in-distribution policies improves. To justify that the performance of SOPR-T is affected by the distribution difference between the training policy set and test policy set, we measure the distribution distance between the two policy sets by:
where is the number of states in the dataset, , and are the number of policies in the training set and test set, respectively. Distance results are shown in Fig.3 and Appendix A.4. As can be seen, the distance between Test Set I and the training policy set is much larger than the distance between Test Set II and the training policy set.
5.3 Further Studies
We further conduct a set of experiments to better understand our algorithm.
Effect of Data Size
We investigate the sensibility of SOPR-T and baseline OPE methods to the size of dataset. For each task, we sample different number of data from the original dataset in trajectory form. For the expert and medium datasets (composed of data collected by a single policy), trajectories are sampled randomly. For the medium-replay dataset (composed of all data during training SAC in sequence), we fetch data sequentially. Note that, in this experiment, we use three datasets, i.e., expert, medium, and medium-replay, as we suppose the first part of data in the full-replay dataset is similar to the medium-replay dataset. We set the number of data (states for SOPR-T, tuples for baseline OPE methods) as 4k, 8k, 16k, and 32k. Note that, in the original datasets, the size of data is not identical among different tasks, but all of them contain more than 200k data. Due to space limitation, we only show the overall performance of all the methods in 9 tasks corresponding to 16k data in Fig.4. We can observe that, when using much less data, SOPR-T still outperforms baseline OPE methods. In Appendix A.5, Fig.12 shows average performance of each method with different data size. The results indicate that SOPR-T is robust with different data size. Even with very small data size like 4k, SOPR-T still performs well. Additionally, SOPR-T outperforms baseline OPE methods consistently.
Effect of Transformer Encoder
We investigate the effect of using Transformer encoder to encode state action pairs. To this end, we replace Transformer encoder with MLP encoder. We name the MLP-based SOPR as SOPR-MLP. The number of hidden layers and units of hidden layers in MLP encoder is aligned with Transformer encoder used in SOPR-T. Details of training and inference methods are also the same as SOPR-T.
We compare the performance of SOPR-MLP and SOPR-T also with two test policy sets. Results of ranking Test Set I in 4 tasks (Hopper) are shown in Fig.5. Other results are shown in Appendix A.5. As can be seen from the results, SOPR-T outperforms SOPR-MLP in most of the tasks. Specifically, when ranking Test Set I, SOPR-T achieves higher rank correlation in 8 out of 12 tasks and lower or the same (zero) regret value in 9 out of 12 tasks. For Test Set II, SOPR-T achieves higher rank correlation in 9 out of 12 tasks, and lower or the same (zero) regret value in all the tasks. It demonstrates that the Transformer encoder is superior to MLP in terms of encoding state action pairs and representing policies in our SOPR method.
In this part, we investigate the performance variance of SOPR-T in inference. Especially, we want to see the variance of SOPR-T with small data in inference. In many real-world applications, the inference speed of policy ranking matters a lot. Although our SOPR-T is efficient due to its end-to-end scoring model, we would also like to investigate whether SOPR-T can use less data in inference than that used in the training phase in order to reduce inference time. To this end, in inference, we sample different number (1, 5, 10, 50, 100, and 200) of sub datasets for SOPR-T to compute an average score for each policy, and then leverage the scores of different policies to rank them. The size of each sub dataset is the same as the batch size of training, which is provided in Appendix A.1. Given the number of sub datasets, the variance of rank correlation/regret value is computed by using five results on five random sets of sub datasets.
The relationship between the variance of rank correlation/regret value and the number of sampled sub datasets is shown in Fig.6 (Hopper) and Appendix A.5. As can be seen from the results, when using different number of sub datasets in inference, even using only one sub dataset, the standard deviations of both rank correlation and regret of SOPR-T are small. As the number of sub datasets increases, the standard deviation decreases gradually, and thus SOPR-T achieves more stable performance.
6 Conclusions and Future Work
In this work, we defined a new problem, supervised off-policy ranking (SOPR), which aims to score target policies for the purpose of correct ranking instead of precise return estimation, by leveraging a set of training policies with known performance (ranking) in addition to the off-policy data. We proposed a method to learn a Transformer based scoring function, which first represents a policy to be evaluated by a set of state action pairs and then maps such a set to a score indicating the goodness of the policy. The scoring function is trained by minimizing pairwise ranking loss. Experiments demonstrate the effectiveness and stability of SOPR-T over strong baseline OPE methods in terms of both rank correlation and regret value.
As a newly defined problem, our work is just a first and very preliminary step towards solving it. There are many possible future directions to explore. First, beyond the state-action pair-based representations in this work, how to better represent a policy? For example, considering the existence of many OPE methods, the values of a policy estimated by those methods can be used to represent the policy, and one can conduct supervised learning to combine the estimations of those methods. Furthermore, the policy values estimated by those OPE methods can serve as features and enhance the representations of a policy adopted in this work. Second, we employed a Transformer based scoring function in this work. It is interesting to explore other possibilities for the scoring function. Third, we used the pairwise loss in our algorithm. There are other ranking losses (e.g., listwise loss Cao et al. (2007)) which are worth trying. Fourth, since we conducted supervised learning, a limitation of our algorithm is that its effectiveness will be heavily impacted by the quality of training policies. We will study and quantify the impact of the quality of training policies. Fifth, as a supervised learning problem, there are many theoretical questions to answer Vapnik (1999, 2013). For example, is a learning algorithm consistent for supervised OPE or supervised OPR? What is the rate of convergence of a learning algorithm? What is the generalization bound of a learning algorithm? We hope that our work can inspire more research along the direction of OPE/OPR.
Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §4.1.
Critic Regularized Regression.
Hands-On Machine Learning with R, pp. 121–140. Note: CRR External Links: Cited by: §2.
- Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pp. 89–96. Cited by: §4.2.
- Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §6.
- Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §2.
- More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pp. 1447–1456. Cited by: §1.
- D4rl: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219. Cited by: §1, §5.
- Benchmarking Batch Deep Reinforcement Learning Algorithms. pp. 1–13. External Links: Cited by: §2.
- Off-policy deep reinforcement learning without exploration _Full. 36th International Conference on Machine Learning, ICML 2019 2019-June, pp. 3599–3609. External Links: Cited by: §2.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870. Cited by: §5.
- Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pp. 2605–2613. Cited by: §1.
Bootstrapping with models: confidence intervals for off-policy evaluation. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §1, §2.
- Q () with off-policy corrections. In International Conference on Algorithmic Learning Theory, pp. 305–320. Cited by: §1.
- Off-policy evaluation via off-policy classification. arXiv preprint arXiv:1906.01624. Cited by: §2.
- When to trust your model: model-based policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §2.
- Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. Cited by: §2.
- Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §2.
- Attention, learn to solve routing problems!. arXiv preprint arXiv:1803.08475. Cited by: §4.1.
- Statistical bootstrapping for uncertainty estimation in off-policy evaluation. arXiv preprint arXiv:2007.13609. Cited by: §1.
- Stabilizing off-policy Q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems 32 (NeurIPS). External Links: Cited by: §2, §5.
- Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779. Cited by: §5.
- Batch policy learning under constraints. 36th International Conference on Machine Learning, ICML 2019 2019-June (i), pp. 6589–6600. Note: FQE External Links: Cited by: §1, §2, §5.
- Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §2.
- Breaking the curse of horizon: infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366. Cited by: §1, §2.
- Bias and variance in value function estimation. In Proceedings of the twenty-first international conference on Machine learning, pp. 72. Cited by: §1, §2.
- Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062. Cited by: §1.
- Dualdice: behavior-agnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: §1, §2, §5.
- Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866. Cited by: §1, §2.
- Reinforcement learning for solving the vehicle routing problem. arXiv preprint arXiv:1802.04240. Cited by: §4.1.
- Hyperparameter Selection for Offline Reinforcement Learning. External Links: Cited by: §2, §5.
- Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §1.
- Reinforcement learning: an introduction. MIT press. Cited by: §2.
- Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §1, §2.
- High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
An overview of statistical learning theory. IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §6.
- The nature of statistical learning theory. Springer science & business media. Cited by: §6.
- Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §4.1.
- Pointer networks. arXiv preprint arXiv:1506.03134. Cited by: §4.1.
- Minimax model learning. In International Conference on Artificial Intelligence and Statistics, pp. 1612–1620. Cited by: §2.
- Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. External Links: Cited by: §2.
- Critic regularized regression. arXiv preprint arXiv:2006.15134. Cited by: §5.
- Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, pp. 9668–9678. Cited by: §1, §2.
Multi-decoder attention model with embedding glimpse for solving vehicle routing problems. arXiv preprint arXiv:2012.10638. Cited by: §4.1.
- Off-policy evaluation via the regularized lagrangian. arXiv preprint arXiv:2007.03438. Cited by: §1, §2.
- Autoregressive dynamics models for offline policy evaluation and optimization. In International Conference on Learning Representations, External Links: Cited by: §1, §2.
- Gendice: generalized offline estimation of stationary values. arXiv preprint arXiv:2002.09072. Cited by: §1, §2.
Appendix A Appendix
a.1 Model and Training Configurations.
Table 1 lists the configurations of our model and training process.
|Input linear projection layer||((dim_s+dim_a), 64)|
|Low-level encoder||n_layers=2, n_head=2, dim_feedforward=128, dropout=0.1|
|High-level encoder||n_layers=6, n_head=8, dim_feedforward=512, dropout=0.1|
|Output linear projection layer||(256, 1)|
|Number of clusters|
a.2 Computational Resource and Time Cost
Our experiments are run with a Nvidia Tesla P100 GPU. Table 2 and 4 show training and inference time cost of SOPR-T, respectively. Table 3 and 5 show time cost of all algorithms. Note that, the time cost of SOPR-T in inference depends on the number of sub datasets used to calculate an average score for a policy. Unless otherwise specified, in inference, we use all () sub datasets that are used in training. We observe that, as shown in the last part of Section 5.3 and the last part of Appendix (Fig.16), SOPR-T with 5 sub datasets achieves nearly the same performance as using 200 sub datasets. Namely, to score and rank 10 policies in each task shown in Table 4, SOPR-T takes less than 25.24 seconds to get a good ranking, which is comparable to IW and much faster than FQE and DualDICE.
|Task Name||Test Set I||Test Set II|
a.3 Additional Results of Ranking Offline Learned Policies (Corresponding to Section 5.1)
In Section 5.1, we presented rank correlation and regret value of using each method to rank offline learned policies (Test Set I) in the Hopper game. Here, Fig. 7 shows all the results in three games.
As can be seen from the results, SOPR-T outperforms baseline OPE methods consistently in four tasks of the Hopper game. Though in Walker2d and Halfcheetah, SOPR-T does not hold consistent superiority, it performs the most stably. That is, SOPR-T does not have negative correlation results in all the settings, whereas all the baseline OPE methods have one or more negative correlation results.
a.4 Additional Results of Ranking In-Distribution Policies (Corresponding to Section 5.2)
In Section 5.2, we presented the results of ranking in-distribution policies (i.e., Test Set II, that are sampled from the same policy set as sampling training policies) in the Hopper game. Here, Fig.8 shows the results in all the games. As can be seen from the results, SOPR-T achieves very high rank correlation and zero regret values in all the games and training data settings. In addition, the variance caused by random seeds is very small.
The performance improvement of SOPR-T in Test Set II compared with Test Set I can be explained by the degree of policy distribution shift. As we mentioned in Section 5.2, to justify that the policy distribution shift of Test Set I is greater than Test Set II, we measure the distance between the set of training policies and that of test policies using the metric:
where is the number of states in the dataset, .
Fig.9 shows the distance results of Test Set I and Test Set II. We can see that, in each data setting, the distance between Test Set I and the training policy set is larger than the distance between Test Set II and the training policy set.
a.5 Additional Results of Section 5.3
Effect of Data Size
In Section 5.3, we investigated the overall performance of each method over 3 games when the size of off-policy dataset is small, i.e., 16k. Here, Fig.10 and Fig.11 show performance of each method in each individual game. Fig.12 shows average performance of each method with different data size.
Note that, for our SOPR-T method, results in this part (Fig. 10 - 12) correspond to the case where both training and testing are conducted on the same dataset, while the results in Fig. 16 correspond to using different amount of data (sub datasets) only in inference, and using the original dataset in training. As we aim to investigate how data size affects the performance of ranking policies, the two test policy sets are fixed while we change the data size.
As shown in Fig.10 and Fig.11, SOPR-T outperforms baseline OPE methods in most of the data settings. In addition, the performance of SOPR-T only degrades slightly with respect to the shrinkage of the data size.
Further, as can be seen from Fig.12, SOPR-T is robust with different data size. Even with very small data size like 4k, SOPR-T still performs well. Particularly, SOPR-T outperforms baseline OPE methods consistently.
Transformer Encoder vs. MLP Encoder
In Section 5.3, we also investigated the performance difference between using Transformer encoder and MLP encoder in our SOPR framework. The corresponding two SOPR models are named SOPR-T and SOPR-MLP, respectively. Here, the performance of SOPR-T and SOPR-MLP on ranking policies in each individual task is shown in Fig.13 (Test Set I) and Fig.15 (Test Set II). The overall performance of SOPR-T and SOPR-MLP over all the tasks is shown in Fig.14 (Test Set I) and Fig.15.(d) (Test Set II). As can be seen from the results, SOPR-T outperforms SOPR-MLP in most tasks in terms of both rank correlation and regret value.
Because the regret values of both SOPR-T and SOPR-MLP in Test Set II over all the tasks are zero, we do not show the regret value and only show rank correction results in Fig.15. The results indicate that both SOPR-T and SOPR-MLP performs well in Test Set II (in-distribution policies). Their overall performance rank as shown in Fig.15.(d) demonstrates that SOPR-T is still better than SOPR-MLP.
Effect of Using Different Number of Sub Datasets in Inference
In the last part of Section 5.3, we investigated the effect of using different number of sub datasets in inference. Because in inference, SOPR-T samples several (denoted as ) sub datasets used in the training phase to calculate an average score of a policy over the sub datasets, the average score may be different with respect to the sub datasets, and thus the ranking result of a policy set may also be different. Therefore, here we study the average and variance of the performance of SOPR-T, and their relationship with the value of . To this end, we set 1, 5, 10, 50, 100, and 200. For each value, we sample sub datasets for 5 times and use SOPR-T to rank policies over each sampling. In each time, the sub datasets are randomly sampled from those used in the training phase with different random seeds. Then, we get 5 rank correlation/regret value results and calculate the average and standard deviation of the results.
Fig.16 shows the average results with a std bar. As we can see from the results, in all the tasks, the number of sub datasets makes little difference on the average performance of SOPR-T. In addition, as the number of sub datasets increases, the performance variance of SOPR-T decreases fast. In almost all the tasks, the performance variance is quite small. Therefore, we can draw a conclusion that the performance of SOPR-T is stable even though it only uses little amount of sub datasets (e.g., ) in inference. Further, the inference time cost can be reduced by using a small number of sub datasets in practice.