## 1 Introduction

Reinforcement learning (RL) is an increasingly important technology for developing highly capable AI systems, it has achieved great success in domains like games mnih2013playing, recommendation systems choi2018reinforcement, and robotics levine2016end, etc. However, the fundamental online learning paradigm in RL is also one of the biggest obstacles to RL’s widespread adoption, as interacting with the environment can be costly and dangerous in real-world settings. Furthermore, even in domains where online interactions are feasible, we might still prefer to utilize previously collected offline data, as online data collection itself is shown to cause poor generalization kumar2020discor.

Offline Reinforcement Learning, also known as batch RL or data-driven RL, aims at solving the abovementioned problems by learning effective policies solely from offline static data, without any additional online interactions. This setting offers the promise of reproducing the most successful formula in Machine Learning (e.g., CV, NLP), where we combine large and diverse datasets (e.g. ImageNet) with expressive function approximators to enable effective generalization in complex real-world tasks

zhan2021deepthermal.In contrast to offline RL, off-policy RL uses a replay buffer that stores transitions that are actively collected by the policy throughout the training process. Although off-policy RL algorithms are considered able to leverage any data to learn skills, past work has shown that they cannot be directly applied to static datasets, due to the extrapolation error of the Q-function caused by out-of-distribution actions fujimoto2019off, and the error can not be eliminated without requiring growing batch of online samples.

Prior works tackle this problem by ensuring that the learned policy stays "close" to the behavior policy via behavior regularization. This is achieved either by explicit constraints on the learned policy to only select actions where has sufficient support under the behavior distribution fujimoto2019off; ghasemipour2021emaq; or by adding a regularization term that calculates some divergence metrics between the learned policy and the behavior policy wu2019behavior; siegel2019keep; zhang2020brac+; dadashi2021offline, e.g., KL divergence jaques2020human or Maximum Mean Discrepancy (MMD) kumar2019stabilizing. While straightforward, these methods lack guaranteed performance improvement against the behavior policy. They also require an accurate

, which is hard to estimate due to the abovementioned distribution shift issue and the iterative error exploitation issue caused by dynamic programming.

In this work, we propose an alternative approach, we start from a conceptual derivation from the perspective of guaranteed policy improvement against the behavior policy. However, the derived objective requires states sampled from the learned policy, which is impossible to get in the offline setting. We then derive a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a discounted marginal state density ratio. We propose a practical way to compute the density ratio and demonstrate its equivalence to a state-dependent behavior regularization. Unlike state-independent regularization used in prior approaches, our regularization term is softer. It allows more freedom of policy deviation at high confidence states, leading to better performance, it also alleviates the distributional shift problem, making the learned policy more stable and robust when evaluating in the testing environment.

We thus term our resulting algorithm Soft Behavior-regularized Actor Critic (SBAC). We present an extensive evaluation of SBAC on standard MuJoCo benchmarks and Adroit hand tasks with human demonstrations. We find that SBAC achieves state-of-the-art performance compared to a variety of existing model-free offline RL methods. We further show that while SBAC learns better policies, it also achieves much more stable and robust results during the online evaluation fujimoto2021minimalist.

## 2 Background

In this section, we introduce the notation and assumptions used in this paper, as well as provide a review of related methods in the literature.

### 2.1 Reinforcement Learning

We consider the standard fully observed Markov Decision Process (MDP) setting

sutton1998introduction. An MDP can be represented as where is the state space, is the action space,is the transition probability distribution function,

is the reward function, is the initial state distribution and is the discount factor, we assume in this work. The goal of RL is to find a policy that maximizes the expected cumulative discounted reward starting from asIn the commonly used actor critic paradigm, one optimizes a policy by alternatively learning a Q-function to minimize squared Bellman evaluation errors over single step transitions as

(1) |

where denotes a target Q-function, which is periodicly synchronized with the current Q function. Then, the policy is updated to maximize the Q-value, .

### 2.2 Offline Reinforcement Learning

In this work, we focus on the offline setting, in which the agent cannot generate new experience data and the goal is to learn from a fixed dataset consisting of single step transitions . The dataset is assumed to be collected using an unknown behavior policy which denotes the conditional distribution observed in the dataset. Note that may be multimodal distributions.

We do not assume complete state-action coverage of the offline dataset, this is contrary to the common assumption in batch RL literature lange2012batch, but is more realistic to the real-world setting. Under this setting, standard reinforcement learning methods such as SAC haarnoja2018soft or DDPG lillicrap2016continuous suffer when applied to these datasets due to extrapolation errors fujimoto2019off; kumar2019stabilizing.

To avoid such cases, behavior regularization is adopted to force the learned policy to stay close to the behavior policy fujimoto2019off; kumar2019stabilizing; wu2019behavior, these approaches share a similar framework, given by

(2) |

where is some divergence measurement.

Although these approaches show promise, none provide improvement guarantees relative to the behavior policy and they share some common problems. The first problem is that these methods require an accurately estimated . However, this is hard to achieve due to the distribution shift between the behavior policy and the policy to be evaluated, which will introduce evaluation errors in Equation (1) levine2020offline. Furthermore, these errors will be accumulated and propagated across the state-action space through the iterative dynamic programming procedure. The second problem is that, even though we can estimate an accurate Q function, the behavior regularization term may be too restrictive, which will hinder the performance of the learned policy. An ideal behavior regularization term should be state-dependent sohn2020brpo. This will make the policy less conservative, exploit large policy changes at high confidence states without risking poor performance at low confidence states. This also holds more theoretical guarantee lee2020batch which is typically missing in current approaches.

## 3 Soft Behavior-regularized Actor Critic

We now continue to describe our approach for behavior regularization in offline RL settings, which aims to circumvent the following issues associated with competing methods described in the previous section, including (1) lack policy improvement guarantee; (2) hard to estimate due to distributional shift; (3) have too much conservatism due to state-independent regularization weight. We begin with a conceptual derivation of our method from the perspective of guaranteed policy improvement against the behavior policy. We then demonstrate our method from a different perspective, we show the equivalence of our method to a kind of soft divergence regularization, this soft regularization considers the marginal-state visitation difference and is state-dependent.

### 3.1 Policy Improvement Derivation

We start from the difference in performance between the two policies. The following lemma allows us to compactly express the difference using the discounted marginal state visitation distribution, , defined by .

###### Lemma 1 (Performance difference kakade2002approximately).

For any two policy and ,

(3) |

This lemma implies that maximizing Equation (3) will yield a new policy with guaranteed performance improvement over a given policy . Unfortunately, this objective cannot be used in the offline setting, as it requires access to on-policy samples from , which is only available by interacting with the environment.

One way to mitigate this issue, and has been largely used in on-policy RL literature, is to assume that policy and give rise to similar state visitation distributions if they are close to each other, in other words, is in the “trust-region” of schulman2015trust; achiam2017constrained; levin2017markov, so that we can use in Equation (3) to get a surrogate objective that approximates .

(4) |

where and

is the total variation distance. There are also other ways to make two policies stay close, e.g., constraining action probabilities by mixing policies

kakade2002approximately; pirotta2013safe or clipping the surrogate objective schulman2017proximal; wang2020truly.However, as discussed in section 2.2, constraining the learned policy to stay close to the behavior policy will be too restrictive^{1}^{1}1Note that in the online setting, this issue is not problematic as we replace with at each iteration , thus ensuring the learned policy with monotonically increased performance..
Furthermore, constraining the distributions of the immediate future actions might not be enough to ensure that the resulting surrogate objective (4) is still a valid estimate of the performance of the next policy. This will result in instability and premature convergence, especially in long-horizon problems.
Instead, we directly apply importance sampling (degris2012off) on Equation (3), yielding

(5) |

this objective reasons about the long-term effect of the policies on the distribution of future states, and can be optimized fully offline given that we can compute the state visitation ratio . Let , now the remaining question is how to estimate using offline data.

We aim to estimate it by using the steady-state property of Markov processes liu2018breaking; gelada2019off, given by the following theorem and the proof is provided in Appendix A.

###### Theorem 1.

Assume whenever , we have function if and only if it satisfies

where is some discrepancy function between distributions and is a time-reserved conditional probability, it is the conditional distribution of given that their next state is following policy .

Note that the operator is different from the standard Bellman operator, although they bear some similarities. More specifically, given some state-action pair , the Bellman operator is defined using next state of , while is defined using previous state-actions that transition to . In this sense, is time-backward. The function actually has the interpretation of the distribution over . Therefore, describes how visitation flows from previous state to next state , which is called the backward flow operator mousavi2019black. Actually, the state visitation ratio is the unique fixed point of , that is , we use this important property to formulate Theorem 1. We also note that similar forms of

have appeared in the safe RL literature, usually used to encode constraints in a dual linear program for an MDP (e.g.,

wang2008stable; satija2020constrained). However, the application of for the state visitation ratio estimation problem and apply it in the offline RL problem appears new to the best of our knowledge.There are many choices for and different solution approaches (e.g., nguyen2010estimating; dai2017learning). In this work, we use the approach based on kernel Maximum Mean Discrepancy (MMD) muandet2017kernel. Given real-valued function and that defined on , we define the following bilinear functional as

(6) |

where is a positive definite kernel function defined on , such as Laplacian and Gaussian kernels.

Let be the reproducing kernel Hilbert space (RKHS) associated with the kernel function , the MMD between two distributions, and , is defined by

Here,

may be considered as a discriminator, playing a similar role as the discriminator network in generative adversarial networks

goodfellow2014generative, to measure the difference between and . A useful property of MMD is that it admits a closed-form expression gretton2012kernel:where is defined in Equation (6), and we use its bilinear property. The expression for MMD does not involve the density of either distribution or , and it can be optimized directly through sample-based calculation.

Given independent transition samples , applying MMD to Theorem 1 produces

(7) |

In above formulation, both and are probability mass functions on , consisting of state-actions encountered in offline data . Therefore, we can optimize this objective by mini-batch training.

Recall that Theorem 1 needs to satisfy to prevent . This condition actually means that should lie in the support of , we accomplish this by using a log-barrier as a support measure, as the log function decreases exponentially when the probability densities of actions sampled from the learned policy are small under the behavior policy (i.e., is small). In practice, the knowledge of is not explicitly provided, and one usually uses behavioral cloning on the offline dataset to approximate wu2019behavior. Note that since is trained as a normalized probability distribution, it will assign low probabilities to actions outside of .

After introducing our method to estimate state visitation ratio , we now go back to Equation (5). Recall that and is a constant with respect to . Compiling all the above results, we can get the learning objective of as

We convert the constrained optimization problem to an unconstrained one by treating the constraint term as a penalty term bertsekas1997nonlinear; le2019batch; wu2019behavior, and we finally get the learning objective as

(8) |

Notice that , so we are optimizing a lower bound of the performance difference , and this lower bound is much tighter than (4) in most cases. Such policy improvement guarantee is not enforced in other offline RL algorithms.

### 3.2 Soft Regularization Derivation

We now provide a different view of our learning objective (8), we can rearrange its form by changing the weight from to the log-barrier , yielding

Comparing to (2), the objective used in previous offline RL algorithms, we actually have the following three changes: (1) We no longer need to estimate , instead, we only need to estimate , can be much easily estimated and robust than , as the behavior policy remains fixed during training. It will no longer suffer from the over-estimation issue when computing the target Q values. (2) We use KL divergence that excludes the learned policy’s entropy as the regularization term, as . The entropy term will result in more stochastic policy distribution, we argue that this property is only useful for exploration in the online setting haarnoja2018soft and will do harm to the offline setting as the optimal policy-induced from the offline data is close to deterministic. (3) We use a state-dependent regularization weight, instead of the state-independent regularization weight used in prior approaches. When is small, the regularization term will be enlarged to make the policy match the state visitation distribution under the behavior policy, . This will alleviate the distribution shift problem levine2020offline; fujimoto2021minimalist, making the learned policy more stable and robust when evaluating in the environment. When is large, it means that the policy has a higher chance to visit the state , the regularization term will thus become small, allowing more freedom of behavior policy deviation, thus leading to better performance.

These three changes and our policy improvement guarantee perfectly address the three aforementioned challenges. We call our algorithm Soft Behavior-regularized Actor Critic (SBAC) , we present the pseudocode of our method in Algorithms 1 and implementation details in Appendix B.

## 4 Experiments

We construct experiments on both widely-used D4RL MuJoCo datasets and more complex Adroit hand manipulation environment (visualized in Figure 1).

We compare SBAC with several strong baselines, including policy regularization methods such as BCQ fujimoto2019off, BEAR kumar2019stabilizing, and BRAC-p/v wu2019behavior; critic penalty methods such as CQL kumar2020conservative and F-BRC kostrikov2021offline. We also compare our method with BRAC+ zhang2020brac+, which employed state-dependent behavior regularization, and AlgaeDICE nachum2019algaedice, which constrained the state distribution shift by applying state-visitation-ratio regularization.

Task name | Policy Regularization | Critic Penalty | ||||||

BCQ | BEAR | BRAC-p/v | BRAC+ | SBAC(Ours) | AlgaeDICE | CQL | F-BRC | |

cheetah-random | 2.2 | 25.1 | 24.1/31.2 | 26.4±1.0 | 28.7±1.2 | -0.3 | 35.4 | 33.3±1.3 |

walker-random | 4.9 | 7.3 | -0.2/1.9 | 16.7±2.3 | 1.3±1.7 | 0.5 | 7.0 | 1.5±0.7 |

hopper-random | 10.6 | 11.4 | 11.0/12.2 | 12.5±0.3 | 16.7±1.3 | 0.9 | 10.8 | 11.3±0.2 |

cheetah-medium | 40.7 | 41.7 | 43.8/46.3 | 46.6±0.6 | 51.4±0.3 | -2.2 | 44.4 | 41.3±0.3 |

walker-medium | 53.1 | 59.1 | 77.5/81.1 | 75.1±3.5 | 81.1±1.1 | 0.3 | 79.2 | 78.8±1.0 |

hopper-medium | 54.5 | 52.1 | 32.7/31.1 | 53.2±3.1 | 99.6±1.7 | 1.2 | 58.0 | 99.4±0.3 |

cheetah-medium-replay | 38.2 | 38.6 | 45.4/47.7 | 46.1±0.2 | 48.2±0.2 | -2.1 | 46.2 | 43.2±1.5 |

walker-medium-replay | 15.0 | 19.2 | -0.3/0.9 | 39.0±4.6 | 78.3±1.2 | 0.6 | 26.7 | 41.8±7.9 |

hopper-medium-replay | 33.1 | 33.7 | 0.6/0.6 | 72.7±18.9 | 98.9±0.8 | 1.1 | 48.6 | 35.6±1.0 |

cheetah-expert | / | 108.2 | 3.8/-1.1 | / | 98.7±0.1 | / | 104.8 | 108.4±0.5 |

walker-expert | / | 106.10 | -0.2/0.0 | / | 113.7±0.2 | / | 153.9 | 103.0±5.3 |

hopper-expert | / | 110.30 | 6.6/3.7 | / | 112.2±0.1 | / | 109.9 | 112.3±0.1 |

cheetah-medium-expert | 64.7 | 53.4 | 44.2/41.9 | 61.2±2.8 | 93.1±0.8 | -0.8 | 62.4 | 93.3±10.2 |

walker-medium-expert | 57.5 | 40.1 | 76.9/81.6 | 95.3±5.9 | 112.4±1.5 | 0.4 | 98.7 | 105.2±3.9 |

hopper-medium-expert | 110.9 | 96.3 | 1.9/0.8 | 112.9±0.1 | 112.8±2.3 | 1.1 | 111.0 | 112.4±0.3 |

Mean performance | 40.5 | 53.5 | 24.5/25.3 | 54.8±3.6 | 76.4±1.0 | 0.1 | 66.5 | 68.0±2.3 |

pen-human | 68.9 | -1.0 | 8.1/0.6 | 64.9±1.6 | 78.8±2.0 | -3.3 | 55.8 | / |

hammer-human | 0.5 | 0.3 | 0.3/0.2 | 3.9±0.9 | 7.7±1.1 | 0.3 | 2.1 | / |

door-human | 0.0 | -0.3 | -0.3/-0.3 | 11.5±1.2 | 10.1±3.5 | -0.0 | 9.1 | / |

relocate-human | -0.1 | -0.3 | -0.3/-0.3 | 0.2±0.11 | 0.6±0.3 | -0.1 | 0.4 | / |

Mean performance | 17.3 | -0.3 | 1.0/0.1 | 20.1±1.0 | 24.3±1.7 | -0.8 | 16.9 | / |

### 4.1 Performance on benchmarking datasets for offline RL

We evaluate our method on the MuJoCo datasets in the D4RL benchmarks fu2020d4rl, including three environments (halfcheetah, hopper, and walker2d) and five dataset types (random, medium, medium-replay, medium-expert, expert), yielding a total of 15 problem settings. The datasets in this benchmark have been generated as follows: random: roll out a randomly initialized policy for 1M steps. expert: 1M samples from a policy trained to completion with SAC. medium: 1M samples from a policy trained to approximately 1/3 the performance of the expert. medium-replay: replay buffer of a policy trained up to the performance of the medium agent. medium-expert: 50-50 split of medium and expert data.

The results of SBAC and competing baselines on MoJoCo datasets are reported in Table 1

. SBAC consistently outperforms all policy regularization baselines (BCQ, BEAR, BRAC-p/v, BRAC+) in almost all tasks, sometimes by a large margin. This is not surprising, as most policy regularization baselines use restrictive, state-independent regularization penalties, which lead to over-conservative policy learning. Compared to the less restrictive critic penalty methods, SBAC also surpasses the performance of strong baselines like CQL and F-BRC in a large fraction of tasks. For a few tasks that F-BRC performs better, SBAC achieves comparable performance. It is also observed that SBAC indeed learns more robust and low-variance policies, which have a small standard deviation over seeds compared with policy regularization (BRAC+) and critic penalty (F-BRC) baselines.

### 4.2 Performance on Adroit hand tasks with human demonstrations

We then experiment with a more complex robotic hand manipulation dataset. The Adroit dataset rajeswaranlearning involves controlling a 24-DoF simulated hand to perform 4 tasks including hammering a nail, opening a door, twirling a pen, and picking/moving a ball. This dataset is particularly hard for previous state-of-the-art works in that it contains narrow human demonstrations on a high-dimensional robotic manipulation task.

It is found that SBAC dominates almost all baselines and achieves state-of-the-art performance. The only exception is the door-human task, in which SBAC reaches comparable performance with BRAC+. This demonstrates the ability of SBAC to learn from complex and non-markovian human datasets.

### 4.3 Analysis of the effect of distribution correction

A notable issue with existing offline RL algorithms is that they exhibit huge variance in performance during evaluation as compared to online trained policies, likely caused by distributional shift and poor generalization fujimoto2021minimalist. In Figure 3 and 3, we report the instability test results of SBAC as well as baseline methods CQL, FisherBRC, and an online TD3 fujimoto2018addressing policy across 10 evaluations at a specific point or over a period of time. SBAC achieves surprisingly low variances in performance that beats all other offline RL algorithms and exhibits a strong ability of distribution error correction. This is extremely important for many safety-aware real world tasks, which require robust and predictable policy outputs.

As discussed in section 3.2, this mainly results from the following three reasons. First, the value function is completely learned from the data via Fitted Q-evaluation le2019batch, which is much easier and more robust to estimate. Since SBAC does not rely on a Q-value function bond to the learned policy as in typical RL algorithms, it no longer suffers from the accumulating exploitation error in the target values during the bootstrap updates of Q-values. Second, the removal of the entropy term in the regularization penalty helps to reduce policy entropy, leading to a more uniform and deterministic behavior. Lastly, the introduction of state-dependent regularization in SBAC places different weights depending on the state visitation ratio, which encourages the learned policy to match the state visitation distribution of the behavior policy.

### 4.4 Ablations

In section 3.2, we discussed that compared to objective (2), SBAC replace with and use the state visitation ratio as the state-dependent regularization weight. We now aim to dive deeper into each component and investigate its impact on the training performance.

Ablation 1 Our first ablation isolates the effect of on the performance. Formally, we use the following objective to learn the policy, with everything else remains unchanged.

This objective is similar to objective (4), the surrogate objective of online RL methods introduced in section 3.1. In Figure 4, it can be seen that using this objective is actually doing one-step policy optimization against the behavior policy, which is expected to be suboptimal brandfonbrener2021offline.

Ablation 2 Our second ablation isolates the effect of on the performance, formally, we use the following objective to learn the policy

This objective is almost the same with objective (2) that uses KL divergence but removes the learned policy’s entropy, it can be shown in Figure 4 that the learning procedure is more unstable due to the inaccurate estimation of . Furthermore, the performance of this ablation is lower than SBAC, suggesting that this objective is too restrictive.

## 5 Related Work

Our work contributes to the literature in behavior regularization methods of offline RL wu2019behavior, the primary ingredient of this class of methods is to propose various policy regularizers to ensure that the learned policy does not stray too far from the data generated by the behavior policy. These regularizers can appear explicitly as divergence penalties wu2019behavior; kumar2019stabilizing; fujimoto2021minimalist, implicitly through weighted behavior cloning wang2020critic; peng2019advantage; nair2020accelerating, or more directly through careful parameterization of the policy fujimoto2018addressing; zhou2020latent. Another way to apply behavior regularizers is via modification of the critic learning objective to incorporate some form of regularization to encourage staying near the behavioral distribution and being pessimistic about unknown state-action pairs nachum2019algaedice; kumar2020conservative; kostrikov2021offline

. However, being able to properly quantify the uncertainty of unknown states is difficult when dealing with neural network value functions

buckman2020importance; xu2021constraints.In this work, we demonstrate the benefit of using a state-dependent behavior regularization term. This draws the connection to other methods laroche2019safe; sohn2020brpo, that bootstrap the learned policy with the behavior policy only when the current state-action pair’s uncertainty is high, allowing the learned policy to differ from the behavior policy for the largest improvements. However, these methods measure the uncertainty by the visitation frequency of state-action pairs in the dataset, which is computationally expensive and nontrivial to apply in continuous control settings. There are also some methods trying to measure the state-dependent regularization weights by neural networks zhang2020brac+; lee2020batch, however, these weights are hard to train and lacks physical interpretation, while also performing inferior compared with our method.

In our learning objective, we need to estimate the discounted marginal state visitation ratio . This draws the connection to the Off-policy Evaluation (OPE) literature, which aims to estimate the performance of a policy given data from a different policy hallak2017consistent; liu2018breaking; nachum2019dualdice; gelada2019off; mousavi2019black. There are also some work incorporating marginal state visitation difference between the learned policy and offline data into the learning objective nachum2019algaedice; lee2021optidice, however, optimizing these resulting objectives need to solve difficult min-max optimization problems, which is susceptible to instability and leads to poor performance. This can be shown in our experiments that AlgaeDICE performs poorly and barely solves any tasks. Note that our work is unique in using the marginal state visitation ratio as the regularization weights.

## 6 Conclusions and Limitations

In this paper, we propose a simple yet effective offline RL algorithm. Existing model-free behavior regularized offline RL methods are overly restrictive and often have unsatisfactory performance. Motivated by this limitation, we design a new behavior regularization scheme for offline RL that enables policy improvement guarantee and state-dependent policy regularization. The resulting algorithm, SBAC, performs strongly against state-of-the-art methods in both the standard D4RL dataset and the more complex Adroit tasks. One limitation of SBAC is the need to estimate a behavior policy, which may be impacted by the expressiveness of the behavior policy. We leave it as future work to try to avoid estimating it.

## References

## Appendix A Proof

Notations Given a learned policy , let be the marginal distribution of under , that is, , and the state-action policy generating distribution can be expressed as . Similarly, denote the state-action data generating distribution as , induced by some data-generating (behavior) policy , that is, for . Note that data set is formed by multiple trajectories generated by . For each , we have , .

###### Theorem 2.

Assume whenever , we have function if and only if it satisfies

where is some discrepancy function between distributions and is a time-reserved conditional probability, it is the conditional distribution of given that their next state is following policy .

###### Proof.

According to the definition of , we have

Then we have

Summing both sides over , we get

The above equation is equivalent to

Therefore,

Denoting and then we can have theorem 1.

∎

## Appendix B Implementation Details

All experiments are implemented with Tensorflow and executed on NVIDIA V100 GPUs. For all function approximators, we use fully connected neural networks with RELU activations. For policy networks, we use tanh (Gaussian) on outputs. As in other deep RL algorithms, we maintain source and target Q-functions with an update rate 0.005 per iteration. We use Adam for all optimizers. The batch size is 256 and

is 0.99. We rescale the reward to as , where and is the maximum and the minimum reward in the dataset, note that any affine transformation of the reward function does not change the optimal policy of the MDP. Behavior, Actor, critic and density ratio network are all 3-layer MLP with 256 hidden units each layer. The learning rate of actor and behavior network is , the learning rate of critic and density ratio network is . We search in . We clip by 2.0 to avoid numerically instability.