## Introduction

Offline reinforcement learning (RL) involves using a previously collected static dataset, without any online interaction, to learn an output policy. This problem setting is important for a variety of real world problems where learning online can be dangerous, such as for self-driving cars, or building a good simulator is costly, such as for healthcare. It is also a useful setting for applications where there is a large amount of data available, such as dynamic search advertising.

A challenge of offline RL is that the quality of the output policy can be highly dependent on the data. The most obvious issue is that the data might not cover parts of the environment, resulting in two issues. The first is that the learned policy, when executed in the environment, is likely to deviate from the behavior and reach a state-action pair that was unseen in the dataset. For these unseen state-action pairs, the algorithm has no information to choose a good action. The second issue is that if the dataset does not contain transitions in the high-reward regions of the state-action space, it may be impossible for any algorithm to return a good policy. We can easily construct a family of adversarial MDPs with missing data such that no algorithm can identify the MDP and suffer a large suboptimality gap.

The works that provide guarantees on the suboptimality of the output policy usually rely on strong assumptions about good data coverage and mild distribution shift. The theoretical results are for methods based on approximate value iteration (AVI) and approximate policy iteration (API), with results showing the output policy is close to an optimal policy (farahmand2010error; munos2003error; munos2005error; munos2007performance). They assume a small concentration coefficient, which is the ratio between the state-action distribution induced by any policy and the data distribution munos2003error. However, the concentration coefficient can be very large or even infinite in practice, so assuming a small concentration coefficient can be unrealistic for many real world problems.

To avoid making strong assumptions on the concentration coefficient, several works consider constraining divergence between the behavior policy and the output policy on the policy improvement step for API algorithms. The constraints can be enforced either as direct policy constraints or by a penalty added to the value function levine2020offline; wu2019behavior. Another direction is to constrain the policy set such that it only chooses actions or state-action pairs with sufficient data coverage when applying updates for AVI algorithms kumar2019stabilizing; liu2020provably. However, these methods only work when the data collection policy covers an optimal policy (see our discussion in Section 3).

Another direction has been to assume pessimistic values for unknown state-action pairs, to encourage the agent to learn an improved policy that stays within the parts of the space covered by the data. CQL kumar2020conservative

penalizes the values for out-of-distribution action and learns a lower bound of the value estimates. MOReL

kidambi2020morel learns a model and an unknown state-action detector to partition states similar to the R-Max algorithm brafman2002r, but then uses the principle of pessimism for these unknown states rather than optimism. Safe policy improvement methods thomas2015high; laroche2019safe rely on a lower bound on the output policy performance with high-confidence and perform policy improvement only when the performance is higher than a threshold.These methods have been shown to be effective empirically on several benchmark environments such as the D4RL dataset fu2020d4rl. However, these methods often introduce extra hyper-parameters which are not easy to tune in the offline setting wu2019behavior. Some methods require an estimate of the behavior policy or the data distribution kumar2019stabilizing; liu2020provably. Others can be too conservative and fail drastically when the behavior policy is not near-optimal kumar2020conservative; liu2020provably, as we reaffirm in our experiments.

Intuitively, however, there are settings where offline RL should be effective. Consider a trading agent in a stock market. A policy that merely observes the stock prices and volumes without buying or selling any shares provides useful information about the environment. For this collected dataset, an offline agent could counter-factually reason about the utility of many different actions, because its actions have limited impact on the prices and volumes. In fact, this is a common assumption in the literature abernethy2013adaptive; nevmyvaka2006reinforcement, namely that we can choose any action without affecting stock prices. In such a setting, even if we do not take an action in a state, we still have information about the return for that action.

In this paper, we first formalize this intuition and introduce the action impact regularity (AIR) property. We say an MDP has the AIR property if the actions have a limited impact on environment dynamics, and rather primarily only affect state variables specific to the agent. We discuss several real-world examples where the regularity holds. Then, we design an efficient algorithm which utilizes the regularity. The algorithm has several advantages over existing algorithms: (1) it does not require an estimate of the behavior policy or the data distribution, (2) there is a straightforward approach to select hyperparameters using just the given data and (3) its performance is much less sensitive to the quality of the data collection policy, if AIR holds. Finally, we provide a theoretical analysis on the suboptimality of the output policy and an empirical study to compare our new algorithm to other offline RL baselines in several simulated environments with AIR.

## Problem Formulation

In reinforcement learning (RL), an agent interacts with its environment, receiving observations and selecting actions to maximize a reward signal. This interaction can be formalized as a finite horizon Markov decision process (MDP)

. is a set of states, and is an set of actions; for simplicity, we assume that both sets are finite.is the transition probability where

is the set of probability distributions on

, and by slightly abusing notation, we will write as the probability that the process will transition into state when in state it takes action . The function gives the reward when taking action in state where is a positive constant. is the planning horizon, and is the initial state distribution.In the finite horizon setting, the policies are non-stationary. A non-stationary policy is a sequence of memoryless policies where . We assume without loss of generality that , denoted by the set of state reachable at time step , are disjoint with each other, since we can always redefine the MDP with the new state space where . Then, it is sufficient to consider stationary policies . We abuse notation and will write to denote the probability of choosing action at state .

Given a policy , , and , we define the value function and the action-value function as and where the expectation is with respect to (we may drop the subscript when it is clear from the context). is the probability measure on the random element in induced by the policy and the MDP such that, for the trajectory of state-action pairs , we have , , and for lattimore2020bandit. The optimal value function is defined by , and the Bellman operator is defined by

In the batch setting, we are given a fixed set of transitions with samples drawn from a data distribution over . In this paper, we consider a more specific setting where the data is collected by a data collection policy . That is, consists of trajectories induced by the the interaction of the policy and MDP .

A representative algorithm for the batch setting is *Fitted Q Iteration* (FQI) (ernst2005tree). In the finite horizon setting, FQI learns an action-value function for each time step, where is the value function class. The algorithm is defined recursively from the end of the episode: for each time step from to , where is defined by replacing expectations with sample averages for the Bellman operator and .
The output policy is obtained by greedifying according to these action-value functions.

## Common Assumptions for Offline Algorithms

In this section we discuss common assumptions used in offline RL. We highlight that those that enable theoretical guarantees are typically impractical in many real world applications, and that the focus on only the data distribution and behavior policy is insufficient to obtain guarantees. This motivates considering assumptions on the MDP, as we do in this work.

The most common assumptions have been on the data distribution, either to ensure sufficient coverage or ensure coverage of the optimal policy. The first setting has primarily been tackled through the idea of *concentration coefficients* munos2003error. Given a data distribution , the concentration coefficient is defined to be the smallest value such that, for any a policy , . Several results have been given bounding the suboptimality of batch API and AVI algorithms, in terms of the concentration coefficient chen2019information; farahmand2010error; munos2003error; munos2007performance. For example, it has been shown that FQI outputs a near-optimal policy when is small and the value function class is rich enough, where the upper bound on the suboptimality of the output policy scales linearly with chen2019information. More speculatively, recent empirical work agarwal2020optimistic highlights that a sufficiently large and diverse dataset can lead to empirical success of offline RL in Atari games, where the diversity may have promoted a small concentration coefficient.

In practice, the concentration coefficient can be very large or even infinite, and so another direction has been to consider approaches where the data covers a near-optimal policy. The key idea behind these methods is to restrict the policy to choose actions that have sufficient data coverage, which is effective if the given data has near-optimal action selection. For example, BCQL fujimoto2019off and BEAR kumar2019stabilizing only bootstrap values from actions if the probability is above a threshold . MBS-QI liu2020provably extends this to consider state-action probabilities, only bootstrapping from state-action pairs when is above a threshold. The algorithm is modified from FQI by replacing the bootstrap value by and the policy is greedy with respect to . If a state-action pair does not have sufficient data coverage, its value is zero. They show that MBS-QI outputs a near-optimal policy if is small for all . That is, the data provides sufficient coverage for state-action pairs visited under an optimal policy . Though potentially less stringent that having a small concentration coefficient, this assumption is nonetheless quite strong.

Beyond the impracticality of the above assumptions, there have been some recent negative results suggesting that a good data distribution alone is not sufficient. In particular, chen2019information showed that if we do not make assumptions on the MDP dynamics, no algorithm can achieve a polynomial sample complexity to return a near-optimal policy, even when the algorithm can choose any data distribution . wang2020statistical provide an exponential lower bound for the sample complexity of off-policy policy evaluation and off-policy policy optimization algorithms with -realizable linear function class, even when assuming the data distribution induces a well-conditioned covariance matrix. zanette2020exponential provide an example where offline RL is exponentially harder than online RL, even with the best data distribution, -realizable function class and assuming the exact feedback is observed for each sample in the dataset. xiao2021sample provide an exponential lower bound for the sample complexity of obtaining nearly-optimal policies when the data is obtained by following a data collection policy. These results are consistent with the above, since achieving a small concentration coefficient implicitly places assumptions on the MDP.

The main message from these negative results is that assuming a good data distribution alone is not sufficient. We need to investigate realistic problem-dependent assumptions for MDPs. This motivates us to introduce a new MDP regularity in the next section, that is present in real-world problems and for which we can bound suboptimality of the learned policy in terms of the regularity parameter.

## Action Impact Regularity

In this section, we introduce the Action Impact Regularity (AIR) property, a property of the MDP which allows for more effective offline RL. We first provide a motivating example and then give a formal definition of AIR.

### Motivating Example

The basic idea of AIR is that the state is composed of a private component, for which the agent knows the transitions and rewards, and an environment component, on which the agent’s actions have limited influence. In a stock trading example, the environment component is the market information (stock prices and volumes) and the private component is the number of stock shares an agent has. The agent’s actions influence their own number of shares, but as an individual trader, have limited impact on stock prices. Using a dataset of stock prices over time allows the agent to reason counterfactually about the impact of many possible trajectories of actions (buying/selling) on its shares (the private state variable) and profits (the reward for the private state).

In this example, the separation of private and environment variables is known, as are the private models. However, more generally, the separation could be identified or learned by the agent, as has been done for contingency-aware RL agents bellemare2012investigating. Similarly, the private models can also be learned from data. In effect, the separation then allows the agent to more effectively leverage the given data, so as to allow for counterfactual reason for parts of the model. In this paper, we focus on the case where the separation and private models are given to us, both because it is a sensible first step and because it is a practical setting for several real-world MDPs, are we further motivate below.

### Formal Definition

Assume the state space is where is the environment variable and is the private variable. The transition dynamics are and for environment and private variable respectively. The transition probability from a state to another state is . In the rest of the paper, and are used to denote the environmental and private variable of a given state .

###### Definition 1 (Action impact regularity).

An MDP is -AIR if, , and for any policy , the next environment variable distribution is similar under either or for each state , that is,

where is the total variation distance between two probability distributions on . We say an MDP is -AIR if it is -AIR for all policies . (Note -AIR for some implies the MDP is -AIR by the triangle inequality.)

Many real-world problems can be formulated as -AIR MDPs or -AIR MDPs. The optimal order execution problem is a task to sell shares of a stock within steps and the goal is to maximize the profit. The problem can be formulated as an MDP where the environmental variable is the stock price and the private variable is the number of shares left to sell. It is common to assume infinite market liquidity nevmyvaka2006reinforcement or that actions have a small impact on the stock price abernethy2013adaptive; bertsimas1998optimal; this corresponds to assuming the AIR property. Another example is the secretary problem goldstein2020learning, which is an instance of an optimal stopping problem. The goal for the agent is to hire the best secretary out of , interviewed in random order. After each interview, they have to decide if they will hire that applicant, or wait to see a potentially better applicant in the future. The problem can be formulated as a

-AIR MDP where the private variable is a binary variable indicating whether we have chosen to stop or not. Other examples include those where the agent only influences energy efficiency, such as in the hybrid vehicle problem

shahamiri2008reinforcement and the electric vehicles charging problem abdullah2021reinforcement. In the former problem, the agent controls the vehicle to use either the gas engine or the electrical motor at each time step, with the goal to minimize gas consumption; its actions do not impact the driver’s behavior. In the latter problem, the agent controls the charging schedule of an electric vehicle to minimize costs; its actions do not impact electricity cost.In some real world problems, we can even restrict the action set or policy set to make the MDP -AIR. For example, if we know that selling shares hardly impacts the markets, we can restrict the action space to selling less than or equal to shares at all time steps. In the hybrid vehicle example, if the driver can see which mode is used, we can restrict the policy set to only switch actions periodically to minimize distractions for the driver.

In these problems with AIR, we often know the reward and transition for the private variables. For example, for the optimal order execution problem, the reward is simply the selling price times the number of shares sold (and minus transaction cost depending on the problem), and the transition probability for is the inventory level minus the number of shares sold. For the hybrid vehicle, we know how much gas would be used for a given acceleration. This additional knowledge is critical to exploit the AIR property, and is a fundamental component of our algorithm. To be precise, we make the following assumption in this paper.

###### Assumption 1 (AIR with a Known Private Model).

We assume that the MDP is -AIR and that the reward function and are known and deterministic.

## An Algorithm for AIR MDPs

In this section, we propose an algorithm that exploits the AIR property, described in Algorithm 1. The algorithm constructs an empirical MDP from the data and uses a planning algorithm based on value iteration. The key idea is in how the empirical MDP is constructed: it assumes -AIR. This means the transition for environmental state is constructed assuming the action has no impact. The utility of this simple approach is that we can then leverage any planning algorithms, because we have a fully specified model.

#### Constructing an MDP from Data.

The first step is to construct an MDP based on the offline dataset by assuming the underlying MDP is -AIR. The data is randomly generated by running on , that is, sampled according to the probability measure . The pertinent part is the transitions between environment variables. We define a new dataset and construct an empirical MDP assuming these transitions occur for any actions.

In the tabular setting, we can use counts to get initial state distribution , transition probability

for all and since we assume the transition for the private variable is known in Assumption 1. Environment variables not seen in the data are not reachable, and so can either be omitted from

or set to self-loop. For large or continuous state spaces, the simplest option is to use the data as a deterministic model, and deterministically transition through an observed trajectory. Or, to use distribution learning algorithms, we can perform multiple epochs of learning on pairs of states

, where we either cycle through all actions for each pair or uniformly randomly sample a subset of possible actions each time.#### An Efficient Planning Algorithm in .

The second step is to use an efficient planning algorithm to return a good policy in the empirical MDP . In general, we can use any planning algorithm; there is a rich literature on efficient algorithms for both small and large state spaces, especially in the literature on approximate dynamic programming (powell2007approximate). For example, in value iteration with a relatively small state space, we can simply sweep over the entire state and action space repeatedly, and learn a table of action-values. To decide which planning algorithm to select, we have to consider (1) how we iterate over states and actions and (2) how we represent action-values.

For iteration, the selection criteria is primarily based on the size of . The empirical MDP has environment state variables—namely the size of the dataset—and the full space of private variables . For our setting, we might expect to be much smaller than , that is, . So even if is large, the state space reachable in the empirical MDP will be small, and so permits full sweeps.

For the second choice, we assume a generic function class for the action-values. This could be set to the set of tabular functions, if near-zero error is required in the planning procedure. More generally, however, regardless of the size of the state space, it is sensible to use function approximation for the action-values, to enable generalization for faster learning and generalization to states outside the dataset. We summarize this full procedure in Algorithm 1 for tabular state spaces. We present a practical implementation of VI-AIR for continuous state spaces in the appendix.

We can show that this planner provides a near-optimal solution in the empirical MDP. We rely on this result in the next section. We are unaware of this result for the finite horizon setting, though there is a rich literature on error propagation analysis for AVI in the discounted infinite horizon setting antos2006learning; farahmand2011action; farahmand2010error; munos2005error; munos2007performance. For completeness, we provide the corresponding result for the finite horizon setting in Proposition 1 with proof in the appendix.

For the proposition and later sections, we need the following definitions. For a given MDP , we define where is a random element in , the expectation is with respect to , and . Note that we can easily evaluate the errors at each iteration, that is, . This means we can obtain the constant in the proof and so actually compute this worst-case error for the policy outputted by Algorithm 1. The quadratic dependence on is unavoidable, except in the tabular case where the planner returns an optimal policy in .

###### Proposition 1.

Suppose . Then the planner 1 outputs a policy such that .

The computational cost is . The for loop has iterations. Within each iteration, it needs to compute targets, and it requires computation for each target. The latter factor corresponds to the maximum branching factor when we view the MDP as a tree: the maximal number of states we need to look up because is less than or equal to the number of trajectories .

When is large, we can modify VI-AIR to no longer use full sweeps. Instead, we can randomly sample from the empirical MDP state space; to be efficient, such sampling needs to be directed. For extensions to large or continuous , there are several computationally efficient and theoretically sound approaches recently developed under linear function approximation (lattimore2020learning; shariff2020efficient).

## Theoretical Analysis

In this section, we provide a theoretical analysis on the suboptimality of the output policy obtained from planning on when the true MDP is -AIR. The key idea is to introduce a baseline MDP that is -AIR, that approximates which is actually -AIR. The strategy involves two steps. First, in Lemma 1, we show that the baseline MDP can be viewed as the expected version of . Then, in Lemma 2, we show that the value difference for any policy in and the original MDP depends on the AIR parameter of the true MDP.

Define a baseline MDP where and . That is, the transition probability for does not depend on the action taken by the behavior policy , and so is -AIR.

###### Lemma 1.

Given a deterministic policy and , for , the following holds with probability at least :

###### Lemma 2.

If is an -AIR MDP, then for any policy ,

We are now ready to present our main result with the proofs in the appendix.

###### Theorem 1.

Let be the output policy which is ()-suboptimal in , then with probability at least where is an optimal policy in .

The bound has three components: (1) a sampling error term which decreases with more trajectories; (2) the AIR parameter; and (3) an approximation error term which depends on the function approximation used in VI-AIR. If both and are , we need trajectories to obtain a constant suboptimality gap .

One additional advantage of an AIR MDP is that we can do hyperparameter selection on , since we know with high probability that

That is, is close to with a large and a small .

## Simulation Experiments

In this section, we evaluate the performance of our algorithm in two simulated environments with the AIR property: an optimal order execution problem and an inventory management problem. We compare VI-AIR to FQI, MBS-QI liu2020provably, CQL kumar2020conservative and a model-based VI baseline. As we discussed in the previous sections, FQI is expected to work well when the concentration coefficient is small. MBS-QI is expected to perform well when the data covers an optimal policy. CQL is a strong baseline which has been shown to be effective empirically for discrete-action environments such as Atari games. The model-based VI algorithm, which we call VI-MB, has full knowledge of the transitions for the private state variable and reward, and learns the transitions for environmental state variables from the data. It is similar to Algorithm 1 but replaces with the learned environmental model.

The first goal of these experiments is to demonstrate that existing algorithms fail to learn a good policy for some data collection policies, while our proposed algorithm returns a near-optimal policy by utilizing AIR. To demonstrate this, we test three data collection policies: (1) a random policy which is designed to give a small concentration coefficient, (2) a learned policy by a trained DQN algorithm with online interaction, which hopefully covers a near-optimal policy, and (3) a constant policy which does not give a small concentration coefficient and does not cover an optimal policy. The second goal is to validate our theoretical analysis with varying number of trajectories and .

### Algorithm Details

We had several choices to make for MBS-QI, CQL and VI-MB. MBS-QI requires density estimation for the data distribution . For the optimal order execution problem, we use state discretization and empirical counts to estimate the data distribution as used in the original paper. For the inventory problem, the state space is already discrete so there is no need for discretization. We show the results with the best threshold from the set . Note that it is possible that there is no data for some states (or state discretization) visited by the output policy, and for these states, all action values are all zero. To break ties, we allow MBS-QI to choose an action uniformly at random. For CQL, we add the CQL() loss with a weighting when updating the action values. We show the results with the best from the set

as suggested in the original paper. For VI-MB, the model is parameterized by a neural network and learned by minimizing the L2 distance between predicted next environmental states and true next environmental states in the data.

We use the same value function approximator for all algorithms in our experiments: two-layers neural networks with hidden size 128. The neural networks are optimized by Adam kingma2014adam

or RMSprop with learning rate in the set

. All algorithms are trained for 50 epochs. We also tried training the comparator algorithms for longer, but it did not improve their performance. The hyperparameters for VI-AIR are selected based on , which only depends on the dataset. The hyperparameters for comparator algorithms are selected based on —which should be a large advantage—estimated by running the policy on the true environment with rollouts.### Results in the Optimal Order Execution Problem

We design a simple optimal order execution problem. The task is to sell shares within steps. The stock prices are generated by an ARMA() process and scaled to the interval . The private variable is the number of shares left to sell. To construct state, we use the most recent prices with the most recent private variable, that is, . The action space is . The reward function is the stock price multiples by the number of selling shares . When the number of selling shares is greater than , the stock price drops by with probability . The random policy used in the environment chooses with probability 75% and choose with probability 5%. The constant policy always choose action . We provide additional details about this problem in the appendix.

Figure 1 shows the performance of our algorithm and the comparator algorithms with a different number of trajectories and . Our algorithm outperforms other algorithms for all data collection policies. This result is not too surprising, as VI-AIR is the only algorithm to exploit this important regularity in the environment; but nonetheless it shows how useful it is to exploit this AIR property when it is possible. MBS performs slightly better than FQI, however, we found it performs better because the tie-breaking is done with an uniform random policy especially under the constant policy dataset. CQL fails when the data collection policy is far from optimal. VI-MB does not work well with data collected under a constant policy and learned policy. These data collection policies do not provide good coverage for all environmental state and action pairs, so VI-MB fails to learn an accurate model and suffer a worse performance. VI-AIR is otherwise the same as VI-MB, except that it exploits the AIR property, and so is robust to different data collection policies.

### Results in the Inventory Management Problem

We design an inventory management problem based on existing literature kunnumkal2008using; van1997neuro. The task is to control the inventory of a product over stages. At each stage, we observe the inventory level and the previous demand and choose to place a order . The demand where

follows a normal distribution with mean

and and . In the beginning of each episode,is sampled from an uniform distribution in the interval

. The reward function is where is the order cost, is the holding cost, and is the cost for lost sale. The inventory level evolves as: . When the order is greater than , the mean of the demand distribution decreases or increases by with probability respectively. The random policy used in the environment is a policy which chooses a value uniformly and then choose the action . The constant policy always choose action . The private variable in the problem is the inventory level, which can be as large as , so we restrict VI-AIR to plan only for .Figure 2 shows the performance of VI-AIR and comparator algorithms. Again, VI-AIR outperforms the other algorithms for all data collection policies. CQL and MB-VI performs well in this environment. MBS outperforms FQI under the learned policy, but FQI outperforms MBS under the random policy.

### The Gap between the True and Empirical MDPs

By the analysis in the previous section, we know that the output policy of VI-AIR satisfies with probability at least . To validate the theoretical analysis, we investigate the difference with and where is the output policy of VI-AIR. We show the 90th percentile of the difference for each combination of and over 90 data points (30 runs under each data collection policy) in Figure 3. The 90th percentiles scale approximately linearly with and inversely proportional to . The results suggest that the dependence on is linear and the sampling error term goes to zero at a sublinear rate.

## Conclusion

We introduced an MDP property, which we call Action Impact Regularity (AIR), that facilitates learning policies offline. We proposed an efficient algorithm that assumes and exploits the AIR property and bounded the suboptimality of the outputted policy in the real environment. This bound depends on the number of samples and the level to which the true environment satisfies AIR. We showed empirically that the proposed algorithm significantly outperforms existing offline RL algorithms, across two simulated environments and three data collection policies, when the MDP is AIR or nearly AIR. The hardness of offline RL motivates identifying MDP assumptions that make offline RL feasible; this work highlights that AIR is one such property. We motivate throughout that, though a strong assumption, it is nonetheless realistic for several important problems and can help facilitate the real-world use of offline RL in some problems.

This work provides a foundational study for offline RL under AIR, and as such there are many next steps. An important next step is to investigate the method on real-world problems, which will have varying levels of the AIR property. As a part of this, an important direction is to develop high-confidence estimates for , to determine if AIR is satisfied before deploying the proposed offline RL algorithms. Finally, we could consider learning the private variable model and reward model to relax Assumption 1.

Comments

There are no comments yet.