1 Introduction
Deep Reinforcement Learning (deep RL) is dominated by the online paradigm, where agents must interact with the environment repeatedly to explore and learn. This paradigm has attained considerable success on Atari (mnih2015humanlevel), Go (silver2017mastering), StarCraft II and Dota 2 (vinyals2019grandmaster; berner2019dota), and robotics (andrychowicz2020learning). However, the requirements of extensive interaction and exploration make these algorithms unsuitable and unsafe for many realworld applications. In contrast, in the offline setting (fu2020d4rl; fujimoto2018addressing; gulcehre2020rl; levine2020offline), also known as batch RL (ernst2005tree; lange2012batch), agents learn from a fixed dataset previously logged by other (possibly unknown) agents. While the offline setting would enable the application of RL to realworld applications, current algorithms tend to perform worse than their online counterparts.
A key difference between online and offline RL is the impact of overestimating the value of unobserved actions (see Figure 2). In the online case, this overestimation incentivizes agents to explore actions of high expected reward, thus learning by trialanderror (schmidhuber1991possibility). After trying a new action, the agent is able to update its values. In contrast, in the offline case, the agent learns from a fixed dataset, and hence the agent does not get the opportunity to interact with the environment to gather new evidence to correct its values. As a result, the agent can become increasingly deluded.
To minimize the harm caused by overestimation, we propose a new offline RL method that we refer to as Regularized Behavior Value Estimation (). For clarity of presentation, we introduce the ideas of behaviour value estimation and ranking regularization separately. We also show in the experiments that our proposed ranking regularization may be applied to improve deep Qlearning methods.
Behavior value estimation. Intuitively, instead of aiming to estimate the optimal policy , we focus on recovering the value of the behavioural policy . This amounts to removing operators during training to prevent overestimation. Subsequently, to improve upon the behavioral policy, we conduct a single step of policy improvement at deployment time. We found this simple method to work surprisingly well on the offline RL tasks we tried.
Ranking regularization. Behavior value estimation is however not enough to overcome the overestimation problem. For this reason, we introduce a margin regularizer that encourages actions from rewarding episodes to be ranked higher than any other actions. Intuitively, this regularizer pushes down the value of all unobserved stateaction pairs, thereby minimizing the chance of a greedy policy selecting actions underrepresented in the dataset. Employing this regularizer during training minimizes the overestimation impact of the operator when used in combination with Qlearning. However, it also minimizes the impact of the operator at deployment time, thus making the regularizer useful for behaviour value estimation. The ranking regularizer can further push down the value of actions that do not appear in rewarding sequences, allowing the learned value function to improve upon the one corresponding to the behaviour policy.
We evaluate our approach on the opensource RL Unplugged Atari dataset
(gulcehre2020rl), where we show that  outperforms other offline RL methods. We show that  performs better on two more datasets: bsuite (osband2019behaviour) and partially observable DeepMind Lab environments (beattie2016deepmind)^{1}^{1}1We released these datasets under RL Unplugged github repo.. We provide careful ablations and analyses that provide insights into our proposed method and existing offline RL algorithms. Empirically, we find out that  reduces the overestimation from extrapolation by orders of magnitude, and improves sampleefficiency significantly (see Figures 6 and 7).2 Background and Problem Statement
We consider a Markov Decision Process
, where is the set of all possible states and for all possible actions. For simplicity, we focus on discrete actions, though our methods can be extended to continuous actions. An agent starts in state where is a distribution over and takes actions according to its policy , , when in state . Then it observes a new state and reward according to the transition distribution and reward function .The stateaction value function describes the expected discounted return starting from state and action and following afterwards:
(1)  
(2) 
and
is the state value function. In this work, we are particularly interested in the scenario where neural networks are used as function approximators to estimate these functions due to their ability to scale and work on complex tasks.
The optimal policy , which we aim to discover through RL, is one that maximizes the expected cumulative discounted rewards, or expected returns such that . For notational simplicity, we denote the policy used to generate an offline dataset as . For a state in an offline dataset, we use to denote the empirical estimate of computed by summing future discounted rewards over the trajectory that is part of.
RL algorithms can be broadly categorized as either onpolicy or offpolicy. Whereas onpolicy algorithms update their current policy based on data generated by itself, offpolicy approaches can take advantage of data generated by other policies. Algorithms in the mold of fitted Qiteration make up many of the most popular approaches to deep offpolicy RL (mnih2015humanlevel; lillicrap2015continuous; haarnoja2018soft). This class of algorithms learns a function by minimizing the Temporal Difference (TD) error. To increase stability and sample efficiency, experience replay is typically employed. For example, DQN (mnih2015humanlevel)
minimizes the following loss function with respect to
:(3) 
where is a slowly changing target network, and is the replay dataset generated by a behaviour policy. Typically, for offpolicy algorithms the behavior policy is periodically updated to remain close to the policy being optimized. A deterministic policy can be derived by defining . Various extensions have been proposed, including variants with continuous actions (lillicrap2015continuous; haarnoja2018soft), distributional critics (bellemare2017a), prioritized replays (schaul2015prioritized), and nstep returns (kapturowski2018recurrent; barthmaron2018distributional; hessel2017rainbow).
In the offline RL setting, agents learn from fixed datasets generated via other processes, thus rendering offpolicy RL algorithms particularly pertinent. Many existing offline RL algorithms adopt variants of Equation (3) to learn value functions; e.g. (agarwal2019striving). Offline RL, however, is different from offpolicy learning in the online setting. The offline RL datasets are usually finite and fixed, and does not track the policy being learned. When a policy moves towards a part of the state space not covered by the behavior policy(s), for example, one cannot effectively learn the value function. We will explore this in more detail in the next subsection.
2.1 Extrapolation and overestimation in offline RL
In the offline setting, when considering all possible actions for the next state in Equation (3), some of the actions will be outofdistribution (OOD). This is because some actions are never picked in that state by the behavior policy used to construct the training set. That is, the stateaction pairs are not present in the dataset. In such circumstances, we have to rely on the current Qnetwork to extrapolate beyond the training data, resulting in extrapolation errors when evaluating the loss^{2}^{2}2Neural networks trained by gradient descent typically struggle to fit underrepresented modes of the data. Hence this argument also holds for rarely observed stateaction pairs.. Importantly, this extrapolation can lead to value overestimation, as explained below.
Value overestimation (see Figure 2) happens when the function approximator predicts a larger value than the ground truth. In short, taking the max over actions of several Qnetwork predictions, as in Equation (3), leads to overconfident estimates of the true value of the state that are propagated in the learning process.
For OOD actions, we depend on extrapolated values provided by the model . We focus on neural networks as the function approximators. While neural networks are efficient learners, they will produce erroneous predictions on unobserved stateaction pairs. Sometimes, these will be arbitrary high. These errors will be propagated in the value of other states via bootstrapping. Due to the smoothness of neural networks, by increasing the value of actions in the OOD stateaction’s neighborhood, the overestimated value itself might increase, creating a vicious loop. We remark that, in such a scenario, typical gradient descent optimization can diverge and escape towards infinity. See Appendix A for a formal statement and proof on this statement, though similar observations have been made before by fujimoto2019off and achiam2019towards. When the agent overestimates some stateaction pairs in the online setting, they will be chosen more often. This leads to optimistic exploration, which happens both in the onpolicy and offpolicy setting. The distinction is that the behavior policy trails the learned one instead of being the same for the latter. The online agent would then act, collect data, and correct extrapolation errors. This form of selfcorrection is absent in the offline setting, and due to the overestimation from extrapolation, the results can be catastrophic. Additionally, OOD actions’ impact becomes larger in the low data regime, where the chance of observing OOD actions is greater and neural networks are more prone to extrapolation errors.
We make a final note on the hardness of the setup we are interested in. To solve our problems, we require powerful function approximators that make analytical studies difficult without strong assumptions that might not hold in practice. Additionally, the tasks are complex largescale tasks, which can make assumptions on the coverage of the stateaction space by the training set improbable or at least impossible to assess. Therefore we mainly rely on empirical evidence to motivate our approach.
3 Regularized Behavior Value Estimation
3.1 Behavior Value Estimation ()
One particular starting point to deal with the vicious loop caused by overestimation could be simply to phrase the problem as a supervised one, where given the data, a neural network is trained to produce for a given state the observed action. For the discrete action setting, the one we are focusing on, this is a well defined classification problem that will mimic the behaviour policy used to collect the data.
However, offline RL aims to improve upon the behavior policy. To do so, many algorithms borrow machinery from the online scenario, as for example, Equation (3) which through the operator tries to improve the current policy by greedily picking the action with the largest estimated value. The repeated application of this improvement step can cause learning to diverge in the offline regime, as discussed earlier. Therefore, to break this cycle, we propose to limit the number of improvement steps being done, bounding the impact of overestimation. In particular, given that policy iteration algorithms typically do not require more than a few steps to converge to optimal policy (lagoudakis2003least; sutton2018reinforcement, Chapter 4.3), we focus on the extreme scenario where we are allowed to take a single improvement step.
We start by drawing inspiration from the SARSA update (rummery1994line; van2009theoretical). SARSA relies on the consecutive application of a policy evaluation step, followed by a policy improvement step. The policy evaluation step is described by the update below:
(4) 
The policy improvement step is implicit by defining the policy used to collect data greedily in terms of . Therefore we can remove the policy improvement step by simply eliminating this dependency of on . We propose Behaviour Value Estimation as applying Equation (4) on transitions collected by the behaviour policy, leading to a Qfunction reflecting the value of this policy.
Finally, we take a single policy improvement step at deployment by acting greedily with respect to , namely . A single improvement step can not guarantee convergence to an optimal behaviour in general. However, our approach will effectively bound the impact of extrapolation errors and empirically provides significant improvement on the tasks considered, see for example Figure 8. This is particularly true in the low data regime where neural networks are less reliable and repeated policy improvement steps are prone to lead to an amplified overestimation of OOD actions, as observed empirically in Figure 6. To highlight the efficiency of a single improvement step, we provide through the lemmas in Appendix B some sufficient conditions for a single improvement step to lead to optimal behaviour.
3.2 Ranking regularization ()
Behaviour Value Estimation effectively reduces overestimation during training by avoiding policy improvement steps but it also avoids estimating the values of OOD actions and focuses on recovering the values of the observed data under the behaviour policy. In contrast to the tabular case, where all OOD actions will have a default value of , neural networks will extrapolate based on the observed data.
We choose to learn a more conservative function that avoids picking actions not seen during training to mitigate the impact of the extrapolation. To this end, here, we simply regularize the output of the Qnetwork to rank the observed actions higher than the OOD ones for successful episodes.
This ranking strategy can be formulated with a hingeloss (chen2009ranking; burges2005learning). We adopt the squaredhinge loss which is commonly used for RankSVM (chapelle2010efficient). Given a transition from the dataset we introduce the following loss:
(5) 
which states that if the value of any action is higher by a a certain margin than the value of the dataset action , then the value function needs to be adjusted.
Previously, hinge losses were used for regularization in RL, but with different goals from ours (su2020conqur; pohlen2018observe).
Blindly applying this ranking loss on suboptimal data can have an undesirable effect. This ranking loss mostly focuses on the frequency of an action in the dataset, promoting frequent actions to have larger value than infrequent ones. For suboptimal behaviour policies is likely that frequency does not correlate with high values.
One practical answer to this problem is to filter out trajectories that are not sufficiently rewarding, reducing the frequency of stateactions that are not relevant for good behaviour. We adopt a soft filtering approach, which simply reweights each transition with the normalized value of the trajectory:
(6)  
where again denotes the empirical estimate of computed by summing future discounted rewards over the trajectory that is part of. The expectation is estimated with an average over minibatches.
The weight can be thought of as a success filter. The formulation of this filtering mechanism is based on the advantage based filtering mechanism used for the policy in CRR (wang2020critic). In our preliminary experiments, we tried both binary filtering with an indicator function and advantage based filtering, but we found that the filtering mechanism introduced above works the best on discrete control algorithms we study here. In the online regime, the filter amplifies the choice of actions in trajectories that fair better than expected, behaving similarly to a policy improvement step. In particular, our loss is closely related to Ranking Policy Gradient (DBLP:conf/iclr/LinZ20). While the same machinery of this work cannot be used to prove convergence of our loss due to the offline character of our regime, it provides intuition of why the filter biases the learning in the right direction. Compared with a typical policy improvement step, this loss, however, also ensures that OOD actions (and actions in nonrewarding sequences) rank lower in the policy.
The ranking loss can be successfully combined with Qlearning, which we will refer as  or, better with behaviour value estimation, . It is due to this reweighting that  can not only further mitigate overestimation compared to but it can also improve on the behaviour policy during training. In all our experiments, we fixed to be
, a hyperparameter value borrowed from CRR
(wang2020critic).4 Related work
Early examples of offline/batch RL include leastsquares temporal difference methods (bradtke1996linear; lagoudakis2003least) and fitted Q iteration (ernst2005tree; riedmiller2005neural). More recent offline RL approaches fall into three broad categories: 1) policyconstraint approaches regularize the learned policy to stay close to the behavior policy either explicitly or implicitly (fujimoto2019off; kumar2019stabilizing; jaques2019way; siegel2020keep; wang2020critic; ghasemipour2020emaq), 2) valuebased approaches lower value estimates for unseen stateactions pairs, either through regularization or uncertainty (kumar2020conservative; agarwal2019optimistic). 3) modelbased approaches similarly lower reward estimates for unseen stateaction pairs (yu2020mopo; kidambi2020morel).
The above methods perform policy improvement during training, and use the methods described above to mitigate extrapolation error and divergence. Our approach is fundamentally different. We perform no policy improvement during training, instead estimating the value of the behavior policy. This approach is related to policyconstraint methods but has three advantages: 1) we do not need to estimate a behavior policy, 2) we do not constrain our policy to be similar to the behavior policy, which may be a poor constraint when the behavior policy is a mixture of many policies some of which are suboptimal and 3) we are guaranteed to avoid extrapolation errors during training. Additionally, our method uses valuebased regularization but only to mitigate extrapolation error when we deploy our policy. Our valuebased regularization is most similar to regularization used in CQL (kumar2020conservative). There are two main differences 1) our regularization is weighted by the success of the trajectory and 2) we use a maxmargin ranking loss. We find these work better in practice.
5 Experiments
We investigate the performance of offline RL algorithms with discrete actions across three domains: bsuite, Atari, and DeepMind Lab. The main question we want to answer is: how do algorithms perform when there is low coverage of stateaction pairs? In that context, we study multiple factors that affect coverage including dataset size (see Figure 6 and 7), noise (see Figure 3 and 9), and partial observability (Figure 8).
We compare our proposed model  (Regularized Behavior Value Estimation), its ablations (Behavior Value Estimation) and  (Regularized Deep Qlearning), and several offline reinforcement learning baselines. For the Atari domain we used the RL Unplugged Atari benchmark (gulcehre2020rl). We also create additional datasets using bsuite to have fast diagnostic tests, and DeepMind Lab to have a challenging partially observable domain. We are working to opensource these datasets. The details of the datasets are provided in Appendix D. In all our experiments, for our CQL baseline, we used CQL() with a fixed as prescribed in (kumar2020conservative) for Atari.
Overall, our method has two hyperparameters that we tuned: for the margin, and the regularization coefficient for the ranking regularization. Our method seems to be quite robust to those values. We tuned only on Atari, and found out that worked best on nine Atari online policy selection games. We used this value for all other tasks and games. We fixed for both Atari and bsuite without any hyperparameter tuning. In particular, these hyperparameters fared well for the offline policy selection games. This provides an evidence that, for a realworld application, one might be able to tune the regularization hyperparameters in a simulated version of the environment and deploy them in the setting of interest. However, on DeepMind Lab, based on a coarse grid for the margin hyperparameter , we found some variation for the optimal value. On our bsuite
and DeepMind Lab plots, we report median and standard error bars across different seeds. On Atari, as accustomed in the literature, we only report median across different Atari games.
5.1 bsuite Experiments
bsuite (osband2019behaviour) is a proposed benchmark designed to highlight key aspects of an agent’s scalability such as exploration, memory or credit assignment. We generated lowcoverage offline RL datasets for catch, mountain_car and cartpole by recording the experiences of an online agent during training, as described by (agarwal2019optimistic), and then subsampling it (see Appendix D.1
for details.) We generated two versions of datasets for each task: stochastic vs. deterministic. The stochastic data is obtained by injecting noise into transitions by replacing an agent’s actions with a random action with probability
.On bsuite experiments, our goals are: firstly to have a computationally cheap setting to compare our proposed methods against the stateoftheart baselines on diagnostic tasks and, secondly, verify that our method is robust against the stochasticity in the transitions.
In Figure 3, we compare the effectiveness of  and  against four baselines: DDQN (hasselt2010double), CQL() (kumar2020conservative), REM (agarwal2019optimistic) and BCQ (fujimoto2018addressing). On the harder dataset (cartpole),  and , our proposed methods, outperform all other approaches on the noisy datasets showing the efficiency and robustness of our approach. Two other methods, REM and CQL, also perform relatively well in the noisy setting. The results for catch are similar, with the exception that BCQ also has better normalized score which reemphasises the importance of restricting behavior to stay close to the observed data. On mountain_car, all methods almost reaches the performance of the expert generated the dataset on the noisy dataset. However, DDQN performed poorly when there is no noise in the transitions. We think, this might be because the noise injected into the dataset although increases stochasticity, it also increases the (state, action)coverage of the dataset. Overall,  and  seem to be less effected by this noise injected.
5.2 Atari Experiments
Atari is an established online RL benchmark (bellemare2013arcade), which has recently attracted the attention of the offline RL community (agarwal2019optimistic; fujimoto2019benchmarking). Here, we used the experimental protocol and datasets from the RL Unplugged Atari benchmark (gulcehre2020rl). We report the median normalized score across the Atari games as prescribed by (gulcehre2020rl).
In Figure 4, we show that  outperforms all baselines reported in the RL Unplugged benchmark as well as CQL() (kumar2020conservative) on offline policy selection games. Both  and  outperform other SOTA offline RL methods. This experiment highlights an important point: on a large benchmark suite (including 37 Atari games) a single step of policy improvement is sufficient to outperform several offline RL algorithms including policy contraint (BCQ), valuebased regularization (CQL), and valuebased uncertainty methods (REM, IQN). Though we should note, in this setting there is enough data for the neural networks to learn reasonable approximations of the Qvalue (exploiting the structure of the state space to extrapolate reasonably for unobserved stateaction pairs.) This figure also highlights the robustness of ranking regularization to its hyperparameters since we did not do any hyperparameter search on the offline policy selection games.
Ablation Experiments on Atari
We ablate two different aspects of our algorithm on nine online policy selection games from RL Unplugged Atari suite: i) the choice of TD backup updates (Qlearning or behavior value estimation), ii) the effect of ranking regularization. We show the ablation of those three components in Figure 5. We observed the largest improvement when using ranking regularization. In general, we found that directly using MonteCarlo estimation for the value function (we refer this in our plot as "MC Learning") does not work on Atari. We also provide results with behavior cloning (BC) (pomerleau1989alvinn) and filtered BC — BC that is only trained on highly rewarding episodes. These are episodes for whom their episodic return is greater than a thereshold, where we set this threshold to be the mean of episodic return in the whole dataset. We showed that BC on the whole dataset works very poorly, but filtered BC works considerably better. Nevertheless, filtered BC still performs considerably worse than other offline RL methods such as .
Overestimation Experiments
One source for overestimation of Qlearning is the maximization bias (hasselt2010double). In the offline setting, another source, as discussed in Section 2, is due to extrapolation errors. Double DQN (DDQN by hasselt2010double) is supposed to address the first problem, but it is unclear whether it can address the second. In Figure 6, we show that in the offline setting DDQN still overestimates severely when we evaluate the critic’s predictions in the environment. This suggests the second source, dominates and DDQN does not address it. In comparison,  overestimates significantly less, highlighting the efficiency of the ranking regularization, however its performance degrades by many orders of magnitude in the low data regime. Finally, and  drastically reduce overestimation in all settings. In the figure, we compute the overestimation error by evaluating the methods in the environment and computing over episodes, where corresponds to the discounted sum of rewards from state till the end of episode by following the policy .
Robustness Experiments
According to Figure 5, and Qlearning both have similar performance on the full dataset, however, in low data regimes, the behavior value policy considerably outperforms Qlearning (see Figure 7). The poor performance of DQN and  in the lower data regime is potentially due to the overestimation that we showed in Figure 6. Furthermore, in Appendix C.2 (see Figure 14), we investigate the robustness of and DDQN with respect to the reward distribution. We found that the performance of is more robust than DDQN to variations of the reward distribution.
5.3 DeepMind Lab Experiments
Offline RL research mainly focused on fully observable environments such as Atari. However, in a complex partially observable environment such as Deepmind Lab, it is very difficult to obtain good coverage in the dataset even after collecting billions of transitions. To highlight this, we have generated datasets by training an online R2D2 agent on DeepMind Lab levels. Specifically, we have generated datasets for four of the levels: , , , and . The details of the datasets are provided in the Appendix D.2.
We compare offline BC, CQL, R2D2, , and  on our DeepMind Lab datasets. We use the same architecture, so the main difference is in the loss function. To test our hypothesis that partially observable environments accentuate the coverage issue we consider the large data regime, using 300M transitions stored during the online training of R2D2. In Figure 8, we show the performance of each algorithm on four different levels. Our proposed modifications,  and outperform other offline RL approaches on all DeepMind Lab levels. We argue that poor performance of R2D2 in the offline setting is due to the implicit low coverage of the dataset. Even at 300M transitions do not seem enough, potentially due to the partially observable nature of the environment as well as its diversity.
The importance of coverage in offline RL:
Here, we investigate the effect of coverage on the DeepMind Lab level on a dataset generated by a fixed policy. To do so, we generated two datasets. The first relies on a R2D2 snapshot, as the behaviour policy, which we refer to as Expert Data. The second uses a noisy version of this policy, where of decisions are taken by the R2D2 agent, and the remaining rely on a uniform policy over actions, referred as Noisy Expert Data. Figure 9 depicts the result on these two datasets. BC outperforms all offline RL approaches on the noiseless expert dataset. However, introducing the noise into the actions deteriorates BC’s performance considerably, and  outperforms it. Over all, the episodic returns obtained by offline RL methods either improve (R2D2, , ) or stay unaffacted (CQL, ) on the noisy expert data compared to their performance on the expert data. This is most likely due to the fact that the noisy expert data provides better coverage. Given that the environment is deterministic, the expert data follows mostly the optimal trajectory offering a relative low coverage of the stateaction space.
6 Discussion
In this work we investigate the deep offline RL setting with discrete actions. In such settings, overestimation errors of the neural network approximator get propagated in the value function through the bootstraping. This can lead to a cycle, where the overestimation gets amplified with every policy improvement step, in the worst case scenario leading these values to escape to infinity. We propose behaviour value estimation an algorithm that limits the number of policy improvements steps to one. Empirically, we showed that this single policy improvement step at deployment is enough for the relatively complex tasks we have considered.
Behavior value estimation is however not enough to overcome the overestimation problem. For this reason, we introduce a maxmargin regularizer that encourages actions from rewarding episodes to be ranked higher than any other actions. This leads to the proposed algorithm  which outperforms all the other SOTA offline RL methods we have compared against for the discrete control on bsuite, RLU Atari and DeepMind Lab tasks. We note that our algorithm is particularly effective for low data regimes, where the overestimation issue discussed in this work is acute.
Finally, we have also proposed two new offline RL datasets: (i) DeepMind Lab: large scale, partially observable environment, (ii) bsuite: small scale, low data regime. We believe these datasets poses unique characteristics and that they will be of interest to the community. As future work, we plan to extend  to continuous control and other realworld applications.
References
Appendix A Qlearning can escape to infinity in the offline case
Remark 1.
Qlearning, using neural networks as a function approximator, can diverge in the offline RL setting given that the collected dataset does not include all possible stateactions pairs, even if it contains all transitions along optimal paths. Furthermore, the parameters (and hence the Qvalues themselves) can diverge towards infinity under gradient descent dynamics.
Proof.
The proof relies on providing a particular instance where Qlearning diverges towards infinity. This is sufficient to show that divergence can happen. Note that the remark does not make any statement of how likely is for this to happen, nor is providing sufficient conditions under which such divergence has to happen.
Let us consider a simple deterministic MDP depicted in the figure below (left).
is the set of all states, where is deterministically the starting state and is the terminal state of the MDP. Let be the set of all possible actions. Let the reward function be for all actionstate pair except which is 1. Let the transition probabilities be deterministic as defined by the depicted arrows. I.e. for any state action pair only transitioning to one state has probability , while the rest has probability . For example, only , while . For only and so on and so forth.
First observation is that the optimal behavior is to pick action (as it is the only rewarding transition in the entire MDP).
The features describing each state are given by a single real number, where , with , where
is the discount factor. Assume actions are provided to the neural network as onehot vectors, i.e.
^{3}^{3}3onehot representation is the typical representation for action in discrete spaces, where we will refer to as the th element of the vector that represents the action . For example and .Let us consider the Qfunction parametrized as a simple MLP (depicted in the figure above left). The MLP uses rectifier activations, and gets as input both the state and action, returning a single scalar value which is the Qvalue for that particular state action combination. Rewriting the diagram in analytical form we have that for and :
(7) 
A note on initialization. The weights of the first layer are given as constants. The process would work if we leave them to be learnable as well, but the analysis would become considerably harder. The exact value used, , are not important. In principle we care for the negative weights connecting to be larger in magnitude than those from to , and we care for the weight between and to be negative. They can be scaled arbitrarily small and do not need to be identical.
What we will rely in the rest of the analysis is that the preactivation of to be negative for state and . This will be in the zero region of the rectifier, meaning no gradient will flow through those units. Since and , it is sufficient for the weight from to to be larger in magnitude than the weight from to . This ensures that for , the Qfunction is not a function of as will get multiplied by .^{4}^{4}4The fact that no gradient gets propagated in the first layer is only important if we attempt to consider the case when the first layer weights are learnable. Also we want the function to never depend on to simplify our analysis, which is easily achievable if the weight going from to is negative.
Given the observations above, if we plug in the formula the different values of and we get that:
(8) 
Note that this implies that
(9) 
Assume . And let the dataset collected by the behavior policy to contain the following 3 transitions:
We can now construct the Qlearning loss that we will use to learn the function in the offline case which will be
(10) 
Note that we relied on Equation (9) to evaluate the operator and is a copy of , that is used for bootstrapping. This is the standard definition of Qlearning see Equation (3). In particular in this toy example is numerically always identical to (in general it can be a trailing copy of from k steps back) and is used more to indicate that when we take a derivative of the loss with respect to we do not differentiate through . From Equation (10) we notice that only the first transition in dataset contributes to the gradient of , only the second transition contributes to the gradient of and only the third transition contributes to the gradient of . We can not evaluate the gradient with respect to of the loss over the entire dataset:
(11) 
Note that we assumed and for simplicity we exploited that numerically, to be able to better understand the dynamics of the update. Given that , will always be negative as long as (and implicitly ) stays positive. Given that for some learning rate , the update creates a vicious loop that will increase the norm of at every iterations, such that . Given that the gradient on tracks , it means that the path that takes action in the initial state will have as value. Note that all transitions along the optimal path of this deterministic MDP are part of the dataset.
Also that given our example, the same will happen if we rely on SGD rather than batch GD (as the different examples affect different parameters of the model independently and there is no effect from averaging). Preconditioning the updates (as for e.g. is done by Adam or rmsprop) will also not change the result as they will not affect the sign of the gradient (the preconditioning matrix needs to be positive definite). Neither momentum will not affect the divergence of learning, as it will not affect the sign of the update.
This means that the provided MDP will diverge towards infinity under the updates on most commonly used gradient based algorithms.
∎
Appendix B The surprising efficiency of 1step of policy improvement
In this section, we show examples where 1step of policy improvement is surprisingly effective. We first start by going over some theoretical results.
Lemma 1.
Assume a Markov Decision Process that satisfies the following conditions: (1) the transition distribution is deterministic, (2) any trajectory has a finite length (3) we are interested in the episodic return, i.e., the discount factor , (4) the reward function for any nonterminating state and for any terminating state with constants . Denote by the set of states from which there exists a trajectory with a reward of . If a behavior policy can generate a trajectory with a reward of with a nonzero probability for any initial state , then one step of policy improvement on the exact behavior value function will give an optimal policy .
Proof.
Let us note that, our proof assumes a tabular case, and by our definition any trajectory receives an episodic return of either or . A policy is optimal in this environment if and only if it receives a reward of with a probability of 1 from any state . For a behavior policy satisfying the premise, its value function is as follows,
(12) 
where denotes the probability of trajectories sampled from with a reward of . The policy after one step of policy improvement is defined as follows with a tie being broken arbitrarily.
(13) 
From any state , let be any sample from . Because , we know , which infers there exists a trajectory starting from with a reward of . Let be the next state following action in the deterministic environment. Consequently we know . We can apply repeatedly to sample a trajectory . Because by induction and any trajectory is finite according to the premise, any sampled trajectory from will eventually reach a terminating state with a reward of , and therefore, the policy receives an episodic return of with a probability of 1 from any state and is therefore an optimal policy. ∎
Remark 1.
The conditions of the MDP can be satisfied by a broad range of reinforcement learning problems, such as a goal reaching task that receives a constant positive reward only if it reaches the goal within a time limit, or an autonomous driving task that will receive a constant penalty if it crashes in a given time period.
Remark 2.
The assumption on the behavior policy assumes a sufficient exploration in the dataset so that a good trajectory exists for any initial state, but it doesn’t impose any restriction on the average performance.
Lemma 2.
In the chain MDP (Figure 10) of at least 2 states with discount , under uniformly random behavior policy, one step of policy improvement (given the true value function) would lead to an optimal policy.
Proof.
We name the states in this class of environment numerically. That is the left most state is state and the last state . Denote to be the value function of the uniform random policy at state . Let . It is easy to show that where is the state transition matrix . We know
Since all rewards are positive, we know . Since , , and , we know . Assume that where , then since . Therefore, by induction, we show that the values are monotonically increasing.
In every state, going left along the chain is therefore of lower value compared to going right (if going right in the rightmost states entails staying). We know the optimal policy in this environment is to go right. Therefore 1 step of policy improvement provides the optimal policy. ∎
Remark 3.
In a 2d grid world, we can show an analogous result. We, however, omit this as the proof follows a similar structure albeit more complicated.
Onestep of policy improvement, despite the examples we have shown in this section, does not always lead to an optimal policy. We argue, however, it is surprisingly good in many scenarios. In this section, we present such an example in a 2d grid world (see Figure 11). In this example, onestep of policy improvement from a random policy is not sufficient to derive an optimal policy. The resulting policy, however, is very close in performance to the optimal policy when measured by the value of the initial state. It is reasonable to assume that many environments share the properties for which BVE is an effective technique.
Appendix C Additional Results and Ablations
c.1 On the Effects of Regularization
In this section, we study the effect of the regularization on the action gap and the overestimation error. In Figure 12, we show that increasing the regularization coefficient for the ranking regularization increases the action gap across the Atari online policy selection games which can result to lower estimation error and better optimization.
In Figure 13, we investigate the effect of increasing the regularization on the overestimation of the Qnetwork when evaluated in the environment. We visualize the mean overestimation across the online policy selection games for Atari.
c.2 Atari: Robustness to Data
The robustness of the reward distribution in the dataset is an important feature required to deploy offline RL algorithms in the realworld. We would like to understand the robustness of behavior value estimation in the offline RL setting. Thus, we first investigate the robustness of in contrast to Qlearning with respect to the datasets’ size and the reward distribution. In Figure 14, we split out the dataset into two smaller datasets: i) transitions coming from only highly rewarding ii) transitions from only poorly performing episodes. We show that outperforms Qlearning in both settings.
c.3 Online Policy Selection Games Results
In Figure 15, we compare the performance of DDQN, , , and  with respect to the rewards they achieve over the course of training on Atari online policy selection games.
c.4 Overestimation on Online Policy Selection Games
In Figure 16 and 17, we report the value error of , ,  and DDQN’s value error and overestimation error respectively. With respect to both metrics we observed that DDQN has the highest value error. Using significantly alleviates the problem with the value and overestimation errors, but ranking regularization further reduces the problem both for DDQN and .
Appendix D Details of Datasets
d.1 BSuite Dataset
BSuite (osband2019behaviour) data was collected by training DQN agents (mnih2015humanlevel) with the default setting in Acme (hoffman2020acme) from scratch in each of the three tasks: cartpole, catch, and mountain_car. We convert the originally deterministic environments into stochastic ones by randomly replacing the agent action with a uniformly sampled action with a probability of (ie. corresponds to the original environment). We train agents (separately for each randomness level and 5 seeds, i.e. 25 agents per game) for 1000, 2000, 500 episodes in cartpole, catch and mountain_car respectively. The number of episodes is chosen so that agents in all levels can reach their best performance. We record all the experience generated through the training process. Then to reduce the coverage of the datasets and make them more challenging we only used 10% of the data by subsampling it. More details of the dataset are provided in Table 1. The results presented in the paper are averaged over the 5 random seeds.
Environments  Number of episodes  Number of transitions  Average episode length 

cartpole ()  1000  710K  710 
cartpole ()  1000  773K  773 
cartpole ()  1000  649K  649 
cartpole ()  1000  607K  607 
cartpole ()  1000  672K  672 
cartpole ()  1000  643K  643 
catch ()  200  1.8K  9 
catch ()  200  1.8K  9 
catch ()  200  1.8K  9 
catch ()  200  1.8K  9 
catch ()  200  1.8K  9 
catch ()  200  1.8K  9 
mountain_car ()  50  10K  205 
mountain_car ()  50  10K  210 
mountain_car ()  50  22K  447 
mountain_car ()  50  13K  277 
mountain_car ()  50  12K  250 
mountain_car ()  50  24K  494 
d.2 DeepMind Lab Dataset
DeepMind Lab (beattie2016deepmind) data was collected by training distributed R2D2 (kapturowski2018recurrent) agents from scratch on individual tasks. First, we tuned the hyperparameters of a distributed version of the Acme (hoffman2020acme) R2D2 agent independently for every task to achieve fast learning in terms of actor steps. Then, we recorded the experience across all actors during entire training runs a few times for every task. Training was stopped after there was no further progress in learning across all runs, with a resulting number of steps for each run between 50 million for the easiest task () and 200 million for some of the hard tasks. Finally we built a separate offline RL dataset for every run and every task. See more details about these datasets in Table 2.
Additionally, for the task we ran two fully trained snapshots of our R2D2 agents on the environment with different levels of noise ( for greedy action selection). We recorded all interactions with the environment and generated a different offline RL dataset containing 10 million actor steps for every agent and every value of .
Task  Episode Length  Datasets  Episodes (K)  Steps (M)  Reward 

300  5  667.1  200.1  39.0  
snapshot ()  300  2  66.7  20  40.4 
snapshot (  300  2  66.7  20  40.1 
snapshot ()  300  2  66.7  20  36.9 
snapshot ()  300  2  66.7  20  29.7 
1350  3  178.3  240.7  51.5  
1800  3  334.1  601.4  64.5  
180  3  2001.1  360.2  32.5  
1800  3  201.8  363.3  48.8 
Appendix E Experiment Details
We used the Adam optimizer (kingma2014adam) for all our experiments. For details on the used hyperparameters, refer to the Table 4 for bsuite, Table 3 for Atari, and Table 5 for DeepMind Lab. Our evaluation protocol is described below, in Section E.1. On Atari experiments, we have normalized the agents’ scores as described in (gulcehre2020rl). On Atari, in all our experiments we report the median normalized score only as accustomed in the literature for reporting results on Atari.
Atari Hyperparameters:
On Atari we directly used the baselines and the hyperparameters reported in (gulcehre2020rl), to get the detailed Atari results on test set we communicated with the authors. We have run additional CQL and our own models with ranking regularization and reparameterization. For CQL we have finetuned both the learning rate from the grid . For our own proposed models we have only tuned the learning rate from the grid and the ranking regularization hyperparameter from the grid . We have fixed the rest of the hyperparameters. As mentioned earlier, we have only used the online policy selection games for finetuning the hyperparameters. As a result of our grid search, we have used learning rate of for CQL and our models. We have used for the hyperparameter of CQL. seems to be the most optimal hyperparameter choice for the ranking regularization hyperparameter on most tasks that we have tried it on. We provide the details of the Atari hyperparameters and compute infrastructure in Table 3.
Hyperparameter  setting (for both variations)  

Discount factor  0.99  
Minibatch size  256  
Target network update period  every 2500 updates  
Evaluation  
network: channels  32, 64, 64  
network: filter size  , ,  
network: stride 
4, 2, 1  
network: hidden units  512  
Training Steps  2M learning steps  
Hardware  Tesla V100 GPU  
Replay Scheme  Uniform  
Hyperparameter  Online  Offline 
Min replay size for sampling  20,000   
Training (for greedy exploration)  0.01   
decay schedule  250K steps   
Fixed Replay Memory  No  Yes 
Replay Memory size  1M steps  2M steps 
Double DQN  No  Yes 
bsuite Hyperparameters:
Our hyperparameter selection protocol for bsuite is the same as Atari. The main difference between the Atari and bsuite experiments is the network architecture that has been used. We provide the details of hyperparameters used for bsuite in Table 4.
Hyperparameter  setting (for both variations)  

Discount factor  0.99  
Minibatch size  128  
Target network update period  every 2500 updates  
Evaluation  
network:  an MLP  
network: hidden units  
Training Steps  2M learning steps  
Hardware  Tesla V100 GPU  
Replay Scheme  Uniform  
Hyperparameter  Online  Offline 
Min replay size for sampling  20,000   
Training (for greedy exploration)  0.01   
decay schedule  250K steps   
Fixed Replay Memory  No  Yes 
Replay Memory size  1M steps  2M steps 
Double DQN  No  Yes 
DeepMind Lab Hyperparameters:
On DeepMind Lab experiments, we tuned the hyperparameters of each model individually on each level separately. We have tuned the learning rate and the methodspecific hyperparameters for each model from the same grid that we have used for Atari. For CQL, the specific hyperparameter that we tuned in addition to the learning rate is the regularization hyperparameter . For our own baselines with the ranking regularization, we fixed the ranking regularization coefficient to the best value we found on Atari, and only tuned the margin hyperparameter from the grid . All our algorithms use nstep returns in our DeepMind Lab experiments, where is fixed to in all our experiments. Thus both behavior value estimation and Qlearning experiments use 5 steps of unrolls for learning. We provide the details of the Deepmind Lab hyperparameters and details of compute infrastructure in Table 5.
Hyperparameter  setting (for both variations)  

Discount factor  0.997  
Target network update period  every 400 updates  
Evaluation  
Importance sampling exponent  0.6  
Architecture  Canonical R2D2 (kapturowski2018recurrent)  
Hyperparameter  Online  Offline 
Hardware  4x TPUv2  4x Tesla V100 GPU 
Training Steps  50200M actor steps  50K learning steps 
Sequence Length  120 (40 burnin)  Full episode 
Minibatch size  32  8 
Training (for greedy exploration)    
Replay Scheme  Prioritized (exponent 0.9)   
Min replay size for sampling  600K steps   
Replay Memory size  12M steps  50200M steps 
e.1 Evaluation protocol
To evaluate the performance of the various methods, we use the following protocol:

We sweep over a small (510) sets of hyperparameter values for each of the methods.

We independently train each of the models on 5 datasets generated by running the behavior policy with 5 different seeds (ie. producing 2550 runs per problem setting and method).

We evaluate the produced models in the original environments (without the noise).

We average the results over seeds and report the results of the best hyperparameter for each method.
e.1.1 Evaluation method
To evaluate models (step 3. above), in the case of bsuite and DeepMind Lab we ran an evaluation job in parallel to the training one. It repeatedly read the learner’s checkpoint and produced evaluation results during training. We report the average of the evaluation scores over the last 100 learning steps.
In the case of the Atari environments, instead of averaging performance during the final steps of learning, we take the final snapshot produced by a given method and evaluate it on a ‘100‘ environment steps after the training finished.
e.2 Atari Offline Policy Selection Results
In Table 6, we show the performance of our baselines on different Atari Offline Policy selection games. We show that  outperforms other approaches significantly.
Name  Normalized Score 

BC  50.8 
DDQN  83.1 
CQL  98.9 
BCQ  102.6 
IQN  104.8 
REM  104.7 
  108.2 
  109.1 
e.3 DeepMind Lab Detailed Results
In Table 7, we have shown the results on the Deepmind Lab datasets. It is possible see from these numerica results that  outperforms other approaches and is still very competitive.
BC  R2D2  CQL    

1.8 1.0  19.8 4.0  23.8 5.1  23.7 3.8  31.4 1.7  
2.9 1.4  8.5 3.4  9.3 2.5  7.6 2.1  13.4 2.6  
0.1 0.1  2.7 1.4  4.0 3.7  9.9 2.7  14.1 4.2  
1.1 4.6  5.4 2.3  3.4 2.4  9.4 2.3  9.6 3.2  
,  28.02 7.6  4.7 3.0  12.8 10.7  4.4 0.9  17.07 10.1 
,  32.4 1.3  5.5 1.6  12.7 5.4  4.1 1.8  19.8 4.9 
,  18.9 14.4  8.6 3.0  16.3 7.7  11.775 4.5  31.8 4.7 
,  17.46 7.5  13.5 5.1  13.5 5.06  9.0 0.25  25.57 7.0 
Appendix F Ranking Regularization
We propose a family of methods that prevent the extrapolation error by suppressing the values of the actions that are not in the dataset. We achieve that by ranking the actions in the training set higher than the ones that are not in the training set. For the learned Qfunction the absolute values of actions do not matter, we are rather interested in relative ranking of the actions. Given is the action from the dataset. For all and illustration purposes, the value iteration can be written as:
where is an irreducible noise, because we can not gather additional data on , and we don’t know the corresponding reward for it. This causes extrapolation error which accumulates through the bootstrapping in the backups as noted by (kumar2019stabilizing). We implicitly pull down the by ranking the actions in the dataset higher which pushes up . As a result, the extrapolation error in Qlearning would also reduce.
f.1 Pairwise Ranking Loss for Qlearning
In this section, we discus the relationship between the pairwise ranking loss for Qlearning and the listwise pairwise ranking losses.
We use a common approximation (chen2009ranking; burges2005learning) of the softplus with the function: