## 1 Introduction

Deep reinforcement learning (RL) methods are
increasingly successful in domains such as games (mnih2013playing), recommender systems (gauci2018horizon), and robotic manipulation (nachum2019multi).
Much of this success relies on the ability to collect new data through *online* interactions with the environment during training, often relying on simulation. Unfortunately, this approach is impractical in many real-world applications where faithful simulators are rare, and in
which active data collection through interactions with the environment is costly, time consuming, and risky.

*Batch (or offline) RL* (lange2012batch) is an emerging research direction that aims to circumvent the need for
online data collection, instead learning a new policy using only offline trajectories generated by some behavior policy
(e.g., the currently deployed policy in some application domain).
In principle, any off-policy RL algorithm (e.g., DDPG (lillicrap2015continuous), DDQN (van2016deep)) may be used in this batch (or more accurately, “offline”)
fashion; but in practice, such methods have been shown to fail to learn when presented with arbitrary, static, off-policy data.
This can arise for several reasons: lack of exploration (lange2012batch)

, generalization error on out-of-distribution samples in value estimation

(kumar2019stabilizing), or high-variance policy gradients induced by covariate shift

(mahmood2014weighted).Various techniques have been proposed to address these issues, many of which can be interpreted as constraining or regularizing the learned policy to be *close* to the behavior policy
(fujimoto2018off; kumar2019stabilizing) (see further discussion below).
While these batch RL methods show promise,
none provide improvement guarantees relative to the behavior policy. In domains for which batch RL is well-suited (e.g., due to the risks of active data collection), such guarantees can be critical to deployment of the resulting RL policies.

In this work, we use the well-established methodology of conservative policy improvement (CPI) (kakade2002approximately) to develop a theoretically principled use of behavior-regularized RL in the batch setting.
Specifically, we parameterize the learned policy as a residual policy, in which a base (behavior) policy is combined linearly with a learned candidate policy using a mixing factor called the confidence.
Such residual policies are motivated by several practical considerations. First, one often has access
to offline data or logs generated by a deployed base policy which is known to perform in reasonably well. The offline
data can be used by an RL method to learn a candidate policy with better *predicted* performance, but if confidence in
parts of that prediction is weak, relying on the base policy may be desirable. The base policy may also
incorporate soft business constraints or some form of interpretability.
Our residual policies blend the two in a learned,
non-uniform fashion.
When deploying a new policy, we use the CPI framework to derive updates
that learn both the candidate policy and the confidence that jointly maximize a lower bound on performance improvement relative to the behavior policy.
Crucially, while traditional applications of CPI, such as TRPO (schulman2015trust), use a constant or state-independent confidence, our performance bounds and learning rules are based on state-action-dependent confidences—this gives rise to bounds that are less conservative than their CPI counterparts.

In Sec. 2, we formalize residual policies and in Sec. 3 analyze a novel
*difference-value function*.
Sec. 4 holds our main result, a tighter lower bound on policy improvement for our residual approach (vs. CPI and TRPO). We derive the BRPO algorithm in Sec. 5 to jointly learn the candidate policy and confidence; experiments in Sec. 6 show its effectiveness.

## 2 Preliminaries

We consider a *Markov decision process (MDP)*
, with state space , action space , reward function
, transition kernel , and initial state distribution .
A policy interacts with the environment, starting at . At step , the policy samples an action from a distribution over and applies.
The environment emits a reward and next state .
In this work, we consider discounted infinite-horizon problems with discount factor .

Let be the set of Markovian stationary policies. The expected (discounted) cumulative return of policy , is . Our aim is to find an optimal policy . In reinforcement learning (RL), we must do so without knowledge of , using only trajectory data generated from the environment (see below) or access to a simulator (see above).

We consider pure offline or *batch RL*, where
the learner has access to a fixed data set (or *batch*) of state-actions-reward-next-state samples
, generated by a (known) *behavior policy* .
No additional
data collection
is permitted. We denote by the -discounted occupation measure of the MDP w.r.t. .

In this work, we study the problem of *residual policy optimization (RPO)* in the batch setting.
Given the behavior policy , we would like to learn a *candidate policy* and a state-action *confidence* , such that the final *residual policy*
maximizes total return. As discussed above,
this type of mixture allows one to exploit an existing, “well-performing” behavior policy. Intuitively, should capture how much we can trust at each

pair, given the available data. To ensure that the residual policy is a probability distribution at every state

, we constrain the confidence to lie in the set#### Related Work.

Similar to the above policy formulation, CPI (kakade2002approximately) also develops a policy mixing methodology that guarantees performance improvement when the confidence is a constant. However, CPI is an online algorithm, and it learns the candidate policy independently of (not jointly with) the mixing factor; thus, extension of CPI to offline, batch setting is unclear. Other existing work also deals with online residual policy learning without jointly learning mixing factors (johannink2019residual; silver2018residual). Common applications of CPI may treat as a hyper-parameter, which specifies the maximum total-variation distance between the learned and behavior policy distributions (see standard proxies in schulman2015trust; pirotta2013safe for details).

*Batch-constrained Q-learning (BCQ)* (fujimoto2018off; fujimoto2019benchmarking) incorporates the behavior policy
when defining
the admissible action set in Q-learning for selecting the highest-valued actions that are similar to data samples in the batch.
*BEAR* (kumar2019stabilizing) is motivated as a means to control the accumulation of out-of-distribution value errors; but its main algorithmic contribution
is realized by adding a regularizer to the loss that measures the kernel maximum mean discrepancy (MMD) (gretton2007kernel) between the learned and behavior policies similar to KL-control (jaques2019way).
Algorithms such as SPI (ghavamzadeh2016safe) and SPIBB (laroche2017safe) bootstraps the learned policy with the behavior policy when the uncertainty in the update for current state-action pair is high, where the uncertainty is measured by the visitation frequency of state-action pairs in the batch data. While these methods work well in some applications it is unclear if they have any performance guarantees.

## 3 The Difference-value Function

We begin by defining and characterizing the *difference-value function*,
a concept we exploit in the derivation of our batch RPO method in Secs. 4 and 5.
For any , let and be the value functions induced by policies and , respectively.
Using the structure of the residual policy, we establish two characterizations of the *difference-value function* .

###### Lemma 1.

Let be the advantage function w.r.t. residual policy , where is the state-action value. The difference-value is where

is the residual reward that depends on and difference of candidate policy and behavior policy .

This result establishes that the difference value is essentially a value function w.r.t. the residual reward. Moreover, it is proportional to the advantage of the target policy, the confidence, and the difference of policies.
While the difference value can be estimated from behavior data batch , this formulation requires knowledge of the advantage function w.r.t. the target policy, which must be re-learned at every -update in an off-policy fashion. Fortunately, we can show that the difference value can
also be expressed as a function of the advantage w.r.t. the *behavior policy* :

###### Theorem 2.

Let be the advantage function induced by , in which is the state-action value. The difference-value is given by , where

is the residual reward that depends on and difference of candidate policy and behavior policy .

In our RPO approach, we exploit the nature of the difference-value function to solve the maximization w.r.t. the confidence and candidate policy: , . Since implies , the optimal difference-value function is always lower-bounded by . We motivate computing with the above difference-value formulation rather than as a standard RL problem as follows. In the tabular case, optimizing with either formulation gives an identical result. However, both the difference-value function in Theorem 2 and the standard RL objective require sampling data generated by the updated policy . In the batch setting, when fresh samples are unavailable, learning with off-policy data may incur instability due to high generalization error (kumar2019stabilizing). While this can be alleviated by adopting the CPI methodology, applying CPI directly to RL can be overly conservative (schulman2015trust). By contrast, we leverage the special structure of the difference-value function (e.g., non-negativity) below, using this new formulation together with CPI to derive a less conservative RPO algorithm.

## 4 Batch Residual Policy Optimization

We now develop an RPO algorithm that has stable learning performance in the batch setting and *performance improvement guarantees*. For the sake of brevity, in the following we only present the main results on performance guarantees of RPO. Proofs of these results can be found in the appendix of the extended paper.
We begin with
the following baseline result, directly applying Corollary 1 of the TRPO result to RPO to
ensure the residual policy performs no worse than .

###### Lemma 3.

For any value function , the difference-return satisfies where the surrogate objective and the penalty weight are

where .

When , one has , , which implies that the inequality is tight—this lemma then
coincides Lemma 1. While this CPI result forms the basis of many RL algorithms (e.g., TRPO, PPO), in many cases it is very loose since is a maximum over all states. Thus, using this bound for policy optimization may be *overly conservative*, i.e., algorithms which rely on this bound must take very small policy improvement steps, especially when the penalty weight is large, i.e., . While this approach may be reasonable in online settings—when collection of new data (with an updated behavior policy ) is allowed—in the batch setting
it is challenging to overcome such conservatism.

To address this issue, we develop a CPI method that is specifically tied to the difference-value formulation, and uses a state-action-dependent confidence . We first derive the following theorem, which bounds the difference returns that are generated by and .

###### Theorem 4.

The difference return of satisfies

where the surrogate objective function, regularization, and penalty weight are given by

respectively, in which is the discounted occupancy measure w.r.t. given initial state .

Unlike the difference-value formulations in Lemma 1 and Theorem 2, which require the knowledge of advantage function or the trajectory samples generated by , the lower bound in Theorem 4 is comprised only of terms that can be estimated directly using the data batch (i.e., data generated by ). This makes it a natural objective function for batch RL. Notice also that the surrogate objective, the regularization, and the penalty weight in the lower bound are each proportional to the confidence and to the relative difference of the candidate and behavior policies. However, the operator requires state enumeration to compute this lower bound, which is intractable when is large or uncountable.

We address this by introducing a slack variable to replace the -operator with suitable constraints. This allows the bound on the difference return to be rewritten as: Consider the Lagrangian of the lower bound:

To simplify this saddle-point problem, we restrict the Lagrange multiplier to be , where

is a scalar multiplier. Using this approximation and the strong duality of linear programming

(boyd2004convex) over primal-dual variables , the saddle-point problem on can be re-written as(1) |

where . The equality is based on the KKT condition on . Notice that the only difference between the CPI lower bound in Theorem 4 and the objective function is that the operator is replaced by expectation w.r.t the initial distribution.

With certain assumptions on the approximation error of the Lagrange multiplier parametrization , we can characterize the gap between the original CPI objective function in Theorem 4 and . One approach is to look into the KKT condition of the original saddle-point problem and bound the sub-optimality gap introduced by this Lagrange parameterization. Similar derivations can be found in the analysis of approximate linear programming (ALP) algorithms (abbasi2019large; de2003linear).

Compared with the vanilla CPI result from Lemma 3, there are two characteristics in problem (1) that make the optimization w.r.t. less conservative. First, the penalty weight here is smaller than in Lemma 3, which means that the corresponding objective has less incentive to force to be close to .
Second, compared with entropy regularization in vanilla CPI, here the regularization and penalty weight are both linear in ; thus, unlike vanilla CPI, whose objective is linear in ,
our objective is quadratic in —this modification ensures the optimal value is not
a *degenerate extreme point* of .^{2}^{2}2For example, when is state-dependent (which automatically satisfies the equality constraints in ), the linear objective in vanilla CPI makes the optimal value a

*0-1 vector*

## 5 The BRPO Algorithm

We now develop the BRPO algorithm, for which the general pseudo-code is given in Algorithm 1. Recall that if the candidate policy and confidence are jointly optimized

(2) |

then the residual policy performs no worse than behavior policy . Generally, solutions for problem (2) use a form of minorization-maximization (MM) (hunter2004tutorial)

, a class of methods that also includes expectation maximization. In the terminology of MM algorithms,

is a surrogate function satisfying the following*MM properties*:

(3) |

which guarantees that it minorizes the difference-return with equality at (with arbitrary ) or at (with arbitrary ). This algorithm is also reminiscent of proximal gradient methods.
We optimize
and in RPO with a simple two-step *coordinate-ascent*. Specifically, at iteration , given confidence , we first compute an updated candidate policy , and with fixed, we update , i.e., . When and are represented tabularly or with linear function approximators, under certain regularity assumptions (the Kurdyka-Lojasiewicz property (xu2013block)) coordinate ascent guarantees global convergence (to the limit point) for BRPO.

However, when more complex representations (e.g., neural networks) are used to parameterize these decision variables, this property no longer holds. While one may still compute

with first-order methods (e.g., SGD), convergence to local optima is not guaranteed. To address this, we next further restrict the MM procedure to develop closed-form solutions for both the candidate policy and the confidence.#### The Closed-form Candidate Policy .

To effectively update the candidate policy when given the confidence , we develop a closed-form solution for . Our approach is based on maximizing the following objective, itself a more conservative version of the CPI lower bound :

(4) |

where for any arbitrary non-negative function . To show that in (4) is an eligible lower bound (so that the corresponding -solution is an MM), we need to show that it satisfies the properties in (3). When , by the definition of the second property holds. To show the first property, we first consider the following problem:

(5) |

where is given in Theorem 4, and

The concavity of (i.e., ) and monotonicity of expectation imply that the objective in (4) is a lower bound of that in (6) below. Furthermore, by the weighted Pinsker’s inequality (bolley2005weighted) , we have: (i) ; and (ii) , which implies the objective in (5) is a lower-bound of that in (2) and validates the first MM property.

Now recall the optimization problem: . Since this optimization is over the state-action mapping , the Interchangeability Lemma (shapiro2009lectures) allows swapping the order of and . This implies that at each the candidate policy can be solved using:

(6) |

where is the state-dependent penalty weight of the relative entropy regularization. By the KKT condition of (6), the optimal candidate policy has the form

(7) |

Notice that the optimal candidate policy is a *relative softmax policy*, which is a common solution policy for many entropy-regularized RL algorithms
(haarnoja2018soft).
Intuitively, when the mixing factor vanishes (i.e., ), the candidate policy equals to the behavior policy, and with confidence we obtain the candidate policy by modifying the behavior policy via *exponential twisting*.

#### The Closed-form Confidence .

Given candidate policy , we derive efficient scheme for computing the confidence that solves the MM problem: . Recall that this optimization can be reformulated as a concave quadratic program (QP) with linear equality constraints, which has a unique optimal solution (faybusovich1997infinite). However, since the decision variable (i.e., the confidence mapping) is infinite-dimensional, solving this QP is intractable without some assumptions about this mapping, To resolve this issue, instead of using the surrogate objective in MM, we turn to its sample-based estimate. Specifically, given a batch of data generated by the behavior policy , denote by

the sample-average approximation (SAA) of functions , , and respectively, where , , and are -dimensional vectors, where each element is generated by a state sample from , and is a -dimensional decision vector, where each -dimensional element vector corresponds to the confidence w.r.t. state samples in .
Since the expectation in , , and is over the stationary distribution induced by the behavior policy, all the SAA functions are *unbiased* Monte-Carlo estimates of their population-based counterparts. We now define as the SAA-MM objective and use this to solve for the confidence vector over the batch samples.

Now consider the following maximization problem:

(8) |

where the feasible set only imposes constraints on the states that appear in the batch .

This finite-dimensional QP problem can be expressed in the following quadratic form:

where the symmetric matrix is given by

and is a -diagonal matrix whose elements are the absolute advantage function. By definition, is positive-semi-definite, hence the QP above is concave. Using its KKT condition, the unique optimal confidence vector over batch is given as

(9) |

where is a -matrix, and the Lagrange multiplier w.r.t. constraint is given by

(10) |

We first construct the confidence function from the confidence vector over , in the following tabular fashion:

While this construction preserves optimality w.r.t. the CPI objective (2), it may be overly conservative, because the policy equates to the behavior policy by setting at state-action pairs that are not in (i.e., no policy improvement). To alleviate this conservatism, we propose to learn a confidence function that generalizes to out-of-distribution samples.

Environment- | DQN | BRPO-C | BRPO (ours) | BCQ | KL-Q | SPIBB | BC | Behavior Policy |
---|---|---|---|---|---|---|---|---|

Acrobot-0.05 | -91.2 9.1 | -94.6 3.8 | -91.9 9.0 | -96.9 3.7 | -93.0 2.6 | -103.5 24.1 | -102.3 5.0 | -103.9 |

Acrobot-0.15 | -83.1 5.2 | -91.7 4.0 | -86.1 10.1 | -97.1 3.3 | -92.1 3.2 | -91.1 44.8 | -113.1 5.6 | -114.3 |

Acrobot-0.25 | -83.4 3.9 | -91.2 4.1 | -85.3 4.8 | -96.7 3.1 | -90.0 2.9 | -86.0 5.8 | -124.1 7.0 | -127.2 |

Acrobot-0.50 | -84.3 22.6 | -90.9 3.4 | -83.7 16.6 | -77.8 13.5 | -84.5 3.8 | -106.8 102.7 | -173.7 8.1 | -172.4 |

Acrobot-1.00 | -208.9 174.8 | -156.8 22.0 | -121.7 10.2 | -236.0 85.6 | -227.5 148.1 | -184.8 150.2 | -498.3 1.7 | -497.3 |

CartPole-0.05 | 82.7 0.5 | 220.8 117.0 | 336.3 122.6 | 255.4 11.1 | 323.0 13.5 | 28.8 1.2 | 205.6 19.6 | 219.1 |

CartPole-0.15 | 299.3 133.5 | 305.6 95.2 | 409.9 64.4 | 255.3 11.4 | 357.7 84.1 | 137.7 11.7 | 151.6 27.5 | 149.5 |

CartPole-0.25 | 368.5 129.3 | 405.1 74.4 | 316.8 64.1 | 247.4 128.7 | 441.4 79.8 | 305.2 119.7 | 103.0 20.4 | 101.9 |

CartPole-0.50 | 271.5 52.0 | 358.3 114.1 | 433.8 93.5 | 282.5 111.8 | 314.1 107.0 | 310.4 128.0 | 39.7 5.1 | 37.9 |

CartPole-1.00 | 118.3 0.3 | 458.6 51.5 | 369.0 42.3 | 194.0 25.1 | 209.7 48.4 | 147.1 0.1 | 22.6 1.5 | 21.9 |

LunarLander-0.05 | -236.4 177.6 | 35.6 61.7 | 88.2 32.0 | 81.5 14.9 | 84.4 26.3 | -200.4 81.7 | 75.8 17.7 | 73.7 |

LunarLander-0.15 | -215.6 140.4 | 79.6 29.7 | 103.9 49.8 | 80.3 16.8 | 61.4 39.0 | 86.1 73.3 | 76.4 16.6 | 84.9 |

LunarLander-0.25 | 2.5 101.3 | 109.5 40.7 | 141.6 11.0 | 83.5 14.6 | 78.7 48.8 | 166.0 90.6 | 57.9 13.1 | 57.3 |

LunarLander-0.50 | -104.6 68.3 | 42.5 71.4 | 101.0 39.6 | -13.2 44.9 | 66.2 78.0 | -134.6 17.1 | -32.6 6.5 | -36.0 |

LunarLander-1.00 | -65.6 45.9 | 53.5 44.1 | 81.8 42.1 | -69.1 44.0 | -139.2 29.1 | -107.1 94.4 | -177.4 13.1 | -182.6 |

#### Learning the Confidence.

Given a confidence vector corresponding to samples in batch , we learn the confidence function in supervised fashion. To ensure that the confidence function satisfies the constraint: , i.e., , , ^{3}^{3}3If one restricts to be only *state*-dependent, this constraint immediately holds., we parameterize it as

(11) |

where is a learnable policy mapping, such that , . We then learn via the following KL distribution-fitting objective (rusu2015policy):

While this approach learns
by generalizing the confidence vector to out-of-distribution samples, when is a NN, one challenge is to
enforce the constraint: , . Instead,
using an in-graph convex optimization NN (amos2017optnet), we parameterize with a NN with the following *constraint-projection layer* before the output:

s.t. | (12) |

where, at any , the -dimensional confidence vector label is equal to chosen from the batch confidence vector such that in is closest to . Indeed, analogous to the closed-form solution in (9), this projection layer has a closed-form QP formulation with linear constraints: , where Lagrange multiplier is given by

Although the -update is theoretically justified, in practice, when the magnitude of becomes large (due to the conservatism of the weighted Pinsker inequality), the relative-softmax candidate policy (7) may be too close to the behavior policy , impeding learning of the residual policy (i.e., ). To avoid this in practice, we can upper bound the temperature, i.e., , or introduce a weak temperature-decay schedule, i.e., , with a tunable .

## 6 Experimental Results

To illustrate the effectiveness of BRPO, we compare against six baselines: DQN (mnih2013playing),
discrete BCQ (fujimoto2019benchmarking),
KL-regularized Q-learning (KL-Q) (jaques2019way),
SPIBB (laroche2017safe), Behavior Cloning (BC) (kober2010imitation), and BRPO-C, which is a simplified version of BRPO that uses a constant (tunable) parameter as confidence weight^{4}^{4}4For algorithms designed for online settings, we modify data collection to sample only from offline / batch data..
We do not consider ensemble models, thus do not include methods like BEAR (kumar2019stabilizing) among our baselines. CPI is also excluded since it is subsumed by BRPO-C with a grid search on the confidence. It is also generally inferior to BRPO-C because candidate policy learning does not optimize the performance of the final mixture policy. We evaluated on three discrete-action OpenAI Gym tasks (openaigym): Cartpole-v1, Lunarlander-v2, and Acrobot-v1.

The behavior policy in each environment is trained using standard DQN until it reaches of optimal performance, similar to the process adopted in related work (e.g., fujimoto2018off). To assess how exploration and the quality of behavior policy affect learning, we generate five sets of data for each task by injecting different random exploration into the same behavior policy. Specifically, we add -greedy exploration for (fully random), , , , and , generating transitions each for batch RL training.

All models use the same architecture for a given environment—details (architectures, hyper-parameters, etc.) are described in the appendix of the extended paper. While training is entirely offline, policy performance is evaluated online using the simulator, at every training iterations. Each measurement is the average return w.r.t. evaluation episodes and random seeds, and results are averaged over a sliding window of size .

Table 1 shows the average return of BRPO and the other baselines under the best hyper-parameter configurations in each task setting. Behavior policy performance decreases as increases, as expected, and BC matches that very closely. DQN performs poorly in the batch setting. Its performance improves as increases from to , due to increased state-action coverage, but as goes higher (, ), the state space coverage decreases again since the (near-) random policy is less likely to reach a state far away from the initial state.

Baselines like BCQ, KL-Q and SPIBB follow the behavior policy in some ways, and showing different performance characteristics over the data sets. The underperformance relative to BRPO is more prominent for very low or very high , suggesting deficiency due to overly conservative updates or following the behavior policy too closely, when BRPO is able to learn.

Since BRPO exploits the statistics of each pair in the batch data, it achieves good performance in almost all scenarios, outperforming the baselines. The stable performance and robustness across various scenarios make BRPO an appealing algorithm for batch/offline RL in real-world, where it is usually difficult to estimate the amount of exploration required prior to training, given access only to batch data.

## 7 Concluding Remarks

We have presented Batch Residual Policy Optimization (BRPO) for learning residual policies in batch RL settings.
Inspired by CPI, we derived learning rules for jointly optimizing both the candidate policy and
*state-action dependent* confidence mixture of a residual policy to maximize a conservative lower bound on policy performance.
BRPO is thus more exploitative in areas of state space that are well-covered by the batch data and more conservative in others.
While we have shown successful application of BRPO to various benchmarks, future work includes deriving finite-sample analysis of BRPO,
and applying BRPO to more practical batch domains (e.g., robotic manipulation, recommendation systems).

## References

## Appendix A Proofs for Results in Section 3

### a.1 Proof of Lemma 1

Before going into the derivation of this theorem, we first have the following technical result that studies the distance of the occupation measures that are induced by and .

###### Lemma 5.

The following expression holds for any state-next-state pair :

where and represent the transition probabilities from state to next-state following policy and respectively, and for any state-next-state pair ,

Comments

There are no comments yet.