# Batch Policy Learning in Average Reward Markov Decision Processes

We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency given multiple trajectories collected under some behavior policy. Based on the proposed estimator, we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. To the best of our knowledge, this is the first regret bound for batch policy learning in the infinite time horizon setting. The performance of the method is illustrated by simulation studies.

## Authors

• 7 publications
• 5 publications
• 9 publications
• ### Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Model-free reinforcement learning is known to be memory and computation ...
10/15/2019 ∙ by Chen-Yu Wei, et al. ∙ 0

• ### Learning Policies for Markov Decision Processes from Data

We consider the problem of learning a policy for a Markov decision proce...
01/21/2017 ∙ by Manjesh K. Hanawal, et al. ∙ 0

• ### Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings

Reinforcement learning is a general technique that allows an agent to le...
01/13/2020 ∙ by C. Shi, et al. ∙ 0

• ### Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

In the optimization of dynamical systems, the variables typically have c...
06/10/2020 ∙ by Qinbo Bai, et al. ∙ 0

• ### Loop estimator for discounted values in Markov reward processes

At the working heart of policy iteration algorithms commonly used and st...
02/15/2020 ∙ by Falcon Z. Dai, et al. ∙ 0

• ### Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes

Off-policy evaluation (OPE) in reinforcement learning is notoriously dif...
09/12/2019 ∙ by Nathan Kallus, et al. ∙ 0

• ### Efficient Inference in Markov Control Problems

Markov control algorithms that perform smooth, non-greedy updates of the...
02/14/2012 ∙ by Thomas Furmston, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the problem of policy optimization in Markov Decision Process over infinite time horizons (puterman1994markov). We focus on the batch (i.e., off-line) setting, where historical data of multiple trajectories has been previously collected using some behavior policy. Our goal is to learn a new policy with guaranteed performance when implemented in the future. In this work, we develop a data-efficient method to learn the policy that optimizes the long-term average reward in a pre-specified policy class from a training set composed of multiple trajectories. Furthermore, we establish a finite-sample regret guarantee, i.e., the difference between the average reward of the optimal policy in the class and the average reward of the estimated policy by our proposed method. This work is motivated by the development of just-in-time adaptive intervention in mobile health (mHealth) applications (Nahum2017). Our method can be used to learn a treatment policy that maps the real-time collected information about the individual’s status and context to a particular treatment at each of many decision times to support health behaviors.

Sequential decision-making problem has been extensively studied in statistics (dynamic treatment regime, murphy2003optimal), econometrics (welfare maximization, manski2004statistical

) and computer science (reinforcement learning,

sutton2018reinforcement). Recently tremendous progress has been made in developing efficient methods that use historical data to learn the optimal policy with performance guaranteed in the finite-time horizon setting; see the recent papers by zhou2017residual; athey2017efficient; kallus2018balanced; zhao2019efficient in the single-decision point problem and zhao2015new; luedtke2016super; nie2019learning in the multiple-decision point problem and the recent review paper by kosorok2019precision for references therein. Many mHealth applications are designed for the long-term use and often involve a large number of decision time points (e.g., hundreds or thousands). For example in HeartSteps, a physical activity mHealth study, there are five decision times per day, resulting in thousands of decision times over a year-long study. Many existing methods developed for the finite-time horizon problem are based on the idea of importance sampling (precup2000eligibility)

, which involves the products of importance weights between the behavior policy and the target policy. This may suffer a large variance especially in problems with a large number of time points

(voloshin2019empirical) as in the case of mHealth. Methods that are based on the idea of backward iteration (e.g., Q-learning) also becomes unpractical as the number of horizon increases (laber2014dynamic).

In this work, we adopt the infinite time horizon homogeneous MDP framework. Although the training data consists of trajectories of finite length, the Markov and time-stationarity assumptions make it possible to evaluate and optimize the policy over infinite time horizons. In infinite time horizon setting, the majority of existing methods focuses on optimizing the discounted sum of rewards (sutton2018reinforcement); see the recent works in statistics luckett2019estimating; ertefaie2018constructing; shi2020statistical. The discounted formulation weighs immediate rewards more heavily than rewards further in the future, which is practical in some applications (e.g., finance). The contraction property of Bellman operator due to discounting also simplifies associated analyses (tsitsiklis2002average; sutton2018reinforcement). For mHealth applications, choosing an appropriate discount rate could be non-trivial. The rewards (i.e., the health outcomes) in the distant future are as important as the near-term ones, especially when considering the effect of habituation and burden. This suggests using a large discount rate. However, it is well known that algorithms developed in the discounted setting can become increasingly unstable as the discount rate goes to one; see for example naik2019discounted.

We propose using the long-term average reward as the criterion in optimizing the policy. The average reward formulation has a long history in dynamic programming (howard1960dynamic) and reinforcement learning (mahadevan1996average). In fact, the long-term average reward can be viewed as the limiting version of the discounted sum of rewards as the discount rate approaches one (bertsekas1995dynamic). We believe that the average reward framework provides a good approximation to the long-term performance of a desired treatment policy in mHealth. Indeed, it can be shown that under regularity conditions the average of the expected rewards collected over finite time horizon converges sublinearly to the average reward as time goes to infinity (hernandez2012further). Therefore, a policy that optimizes the average reward would approximately maximize the sum of the rewards over the long time horizon.

In the three settings discussed above, e.g., finite horizon or infinite horizon discounted sum of rewards or infinite horizon average reward, many methods consider finding the optimal policy (with no restriction) by first estimating the optimal value and then recovering the optimal policy, see for example ormoneit_kernel-based_2003; lagoudakis2003least; ernst_tree-based_2005; munos2008finite; antos_learning_2008; antos2008fitted; ertefaie2018constructing; yang2019sample. A critical assumption behind these methods is the correct modeling of the possibly non-smooth optimal value function, which could be highly complex in practice and thus requires the use of a very flexible function class. The use of the flexible function class usually results in a learned policy that is also complex. If interpretability is important, this is problematic. Furthermore, when the training data is limited, the flexible function class could overfit the data and thus the variance of the estimated value function and the corresponding policy could be high.

We instead aim to learn the optimal policy in a pre-specified policy class; see for example zhang2012robust; zhang2013robust; zhou2017residual; zhao2015new; zhao2019efficient; athey2017efficient in finite time horizon problems and luckett2019estimating; murphy2016batch; liu2019off in infinite time horizon problems. One can use prior knowledge to design the policy class (e.g., selection of the variables into the policy) and thus ensure the interpretability of the learned policy. The restriction to a parsimonious policy class reduces the variance of the learned policy, although this induces the bias when the optimal policy is not in the class (i.e., trading off the bias and variance). We consider a class of parametric, stochastic (i.e., randomized) policies. Recall the motivation of this work is to construct a good treatment policy for use in a future study. To facilitate the analysis after the study is over (e.g., causal inference or off-policy evaluation/learning) , we focus on stochastic policies. Furthermore, it is important to ensure sufficient exploration, which can be controlled by restricting the policy class, i.e., putting constraints on the parameter space (see Section 6 for an example). Similar to ours, murphy2016batch considered the average reward formulation and developed the “batch, off-policy actor–critic” algorithm to learn the optimal policy in a class. Unfortunately, they did not provide any performance guarantees. luckett2019estimating

considered the infinite horizon discounted reward setting and also developed an interesting method to estimate the optimal policy in a parameterized policy class. They evaluated each policy by the discounted sum of rewards where the initial state is averaged over some reference distribution. Under the parametric assumption of the value function, they showed that the estimated optimal value converges to a Gaussian distribution and the estimated policy parameters converge in probability. However, they did not provide the regret guarantee of the learned policy.

In order to efficiently learn the policy, the main challenge is to construct a good estimator for evaluating policies that is both data-efficient and performs uniformly well when optimizing over the policy class. For this purpose, we develop a novel doubly robust estimator for the average reward of a given policy and show that the proposed estimator achieves the semiparametric efficiency bound under certain conditions on the estimation error of nuisance functions (See Section 5). Estimating the value of a policy is known as the off-policy policy evaluation (OPE) problem in the literature of the computer science community. Doubly robust estimators have been developed in the finite time horizon problem (robins1994estimation; murphy2001marginal; dudik2014doubly; thomas2016data) and recently in the discounted reward infinite horizon setting (kallus2019double; Tang2020Doubly). Our estimator involves two policy-dependent nuisance functions: the relative state-action value function and the ratio function between the state-action stationary distribution under the target policy and the marginal state-action distribution in the training data. The estimator is doubly robust in the sense that the estimated average reward is consistent when one of these nuisance function estimators is consistent. Recently, liao2019off developed an estimator for the average reward based on minimizing the projected Bellman error (bradtke_linear_1996), which only involved a single nuisance function, the relative value function. They showed that their estimator is asymptotic normal under certain conditions. When the state-action value function is incorrectly modeled, their estimator may incur a large bias. liu2018breaking also developed an estimator for the average reward based on the ratio function. Although they did not provide a theoretical guarantee of their estimator, it can be seen that the consistency of the estimator may require correct modeling of the ratio function. In contrast to these two methods, our estimator is doubly robust, thus providing an additional protection against model mis-specification. As will be seen in Section 5, the double robustness property of the estimator ensures that the estimation errors of the nuisance functions has minimal impact on the estimation of average reward, which ultimately leads to the semiparametric efficiency and the optimality of the regret bound.

To use the doubly robust estimator, we need to estimate the two nuisance functions. We use the coupled estimation framework to estimate the two nuisance functions. The estimation procedure is coupled in the sense that the estimator is obtained by minimizing an objective function which involves the minimizer of another optimization problem. As opposed to the supervised learning problems (e.g., regression), the coupled estimation is used to resolve the issue that the outcome variable depends on the target (e.g., value function), commonly arising in reinforcement learning problems

(antos_learning_2008). The idea of coupled estimation was previously used in antos_learning_2008; farahmand2016regularized to estimate the value function in the discounted reward setting. Recently liao2019off developed the coupled estimator for relative value function in the average reward setting and derived the finite sample bound for a fixed policy. While the ratio function is defined very differently to the value function, it turns out that similar to the value function, the ratio function can also be characterized as a minimizer of some objective function (see Section 4.3). liao2019off used this characterization to derive a coupled estimator for the ratio function, but they did not provide the finite sample analysis for the ratio estimator.

As a prerequisite to guarantee the semiparametric efficiency of the doubly robust estimator and, more importantly, to establish the regret bound, we derive finite-sample error bounds for both of the nuisance function estimators and the obtained error bounds are shown to hold uniformly over the prespecified class of policies. Although the relative value and ratio estimators are both derived from the same principle (i.e., coupled estimation), it is much harder to bound the estimation error for the ratio function. This is mainly because in the case of value function, the Bellman error being zero at the true value function greatly simplifies the analysis (see Section 5 for details). We use an iterative procedure to handle this and obtain a near-optimal error bound for the ratio estimator. To the best of our knowledge, this is the first theoretical result characterizing the ratio estimation error, which might be of independent interest. Recently, researchers started realizing the important role of the ratio function in OPE problems and designed various estimators (liu2018breaking; uehara2019minimax; nachum2019dualdice; zhang2020gendice). Different from the coupled formulation used in this work, these methods used a min-max loss formulation. More importantly, they did not provide a theoretical guarantee on the estimation error compared with our work.

We learn the optimal policy by maximizing the estimated average reward over a policy class and derive a finite-sample upper bound of the regret. We show that the our proposed method achieves regret, where is the number of parameters in the policy, is the number of trajectories in the training data and is a constant that can be chosen arbitrarily close to . Here measures the complexity of function classes in which the nuisance functions are assumed to stay. The use of doubly robust estimation ensures the estimation error of the nuisance functions is only of lower order in the regret. Unlike in the finite horizon setting (athey2017efficient), the regret analysis in the infinite time setting requires an uniform control of the estimation error of the policy-dependent nuisance functions over the policy class, which makes our analysis much more involved. We believe this is the first regret bound result for infinite time horizon problems in the batch setting.

The rest of the article is organized as follows. Section 2 formalizes the decision making problem and introduces the average reward MDP. Section 3 presents the proposed method of learning the in-class optimal policy, including the doubly robust estimator for average reward (Sec. 3.3). In Section 4, the coupled estimators of the policy-dependent nuisance functions are introduced. Section 5 provides a thorough theoretical analysis on the regret bound of our proposed method. In Section 6, we describe a practical optimization algorithm when Reproducing Kernel Hilbert Spaces (RKHSs) are used to model the nuisance functions. We further conduct several simulation studies to demonstrate the promising performance of our method in Section 7. All the technical proofs are postponed to the supplementary material.

## 2 Problem Setup

Suppose we observe a training dataset, that consists of independent, identically distributed (i.i.d.) observations of :

 D={S1,A1,S2,…,ST,AT,ST+1}.

We use to index the decision time. The length of the trajectory, , is assumed non-random. is the state at time and is the action (treatment) selected at time . We assume the action space, , is finite. To eliminate unnecessary technical distractions, we assume that the state space, , is finite; this assumption imposes no practical limitations and can be extended to the general state space.

The states evolve according to a time-homogeneous Markov process. For , , and the conditional distribution does not depend on . Denote the conditional distribution by , i.e., . The reward (i.e., outcome) is denoted by , which is assumed to be a known function of , i.e., . We assume the reward is bounded, i.e., . We use to denote the conditional expectation of reward given state and action, i.e., .

Let be the history up to time and the current state, . Denote the conditional distribution of given by . Let . This is often called behavior policy in the literature. In this work we do not require to know the behavior policy. Throughout this paper, the expectation, , without any subscript is assumed taken with respect to the distribution of the trajectory, , with the actions selected by the behavior policy .

Consider a time-stationary, Markovian policy,

, that takes the state as input and outputs a probability distribution on the action space,

, that is, is the probability of selecting action, , at state, . The average reward of the policy, , is defined as

 V(s|π):=limt∗→∞Eπ(1t∗t∗∑t=1Rt+1∣∣S1=s), (2.1)

where the expectation, , is with respect to the distribution of the trajectory in which the states evolve according to and the actions are chosen by . Note that the limit in (2.1) always exists as is finite (puterman1994markov). The policy,

, induces a Markov chain of states with the transition as

. When the induced Markov chain, , is irreducible, it can be shown (e.g., in puterman1994markov) that the stationary distribution of exists and is unique (denoted by ) and the average reward, (2.1) is independent of initial state (denoted by ) and equal to

 V(s|π)=V(π)=∑s,ar(s,a)π(a|s)dπ(s). (2.2)

Throughout this paper we consider only the time-stationary, Markovian policies. In fact, it can be shown that the maximal average reward among all possible history dependent policies can be in fact achieved by some time-stationary, Markovian policy (Theorem 8.1.2 in puterman1994markov). Consider a pre-specified class of such policies, , that is parameterized by . Throughout we assume that the induced Markov chain is always irreducible for any policy in the class, which is summarized below.

###### Assumption 1.

For every , the induced Markov chain, , is irreducible.

The goal of this paper is to develop a method that can efficiently use the training data, , to learn the policy that maximizes the average reward over the policy class. We propose to construct , an efficient estimator for the average reward, , for each policy and learn the optimal policy by solving

 ^πn∈argmaxπ∈Π^Vn(π). (2.3)

The performance of is measured by its regret, defined as

 Regret(^πn)=supπ∈ΠV(π)−V(^πn). (2.4)

## 3 Doubly Robust Estimator for Average Reward

In this section we present a doubly robust estimator for the average reward for a given policy. The estimator is derived from the efficient influence function (EIF). Below we first introduce two functions that occur in the EIF of the average reward. Throughout this section we fix some time-stationary Markovian policy, , and focus on the setting where the induced Markov chain, , is irreducible (Assumption 1).

### 3.1 Relative value function and ratio function

First, we define the relative value function by

 Qπ(s,a):=limt∗→∞1t∗t∗∑t=1Eπ[t∑k=1{Rk+1−V(π)}∣∣S1=s,A1=a]. (3.1)

The above limit is well-defined (puterman1994markov, p. 338). If we further assume the induced Markov chain is aperiodic, then the Cesàro limit in (3.1) can be replaced by . is often called relative value function in that represents the expected total difference between the reward and the average reward under the policy, , when starting at state, , and action, .

The relative value function, , and the average reward, , are closely related by the Bellman equation:

 Eπ[Rt+1+Q(St+1,At+1)|St=s,At=a]−Q(s,a)−η=0. (3.2)

Note that in the above expectation . It is known that under the irreducibility assumption, the set of solutions of (3.2) is given by where for all ; see puterman1994markov, p. 343 for details. As we will see in Section 4.2, the Bellman equation provides the foundation of estimating the relative value function. Note that in the Bellman equation (and the efficient influence function presented in the next section), occurs in the form of the difference between the average value of the next state and the value of the current state-action pair. It would be convenient to define

 Uπ(s,a,s′):=∑a′π(a′|s′)Qπ(s′,a′)−Qπ(s,a). (3.3)

We now introduce the ratio function. For , let be the probability mass of state-action pair at time in generated by the behavior policy, . Denote by the average probability mass across the decision times in of length . Similarly, define as the marginal distribution of and as the average distribution of states in the trajectory . Recall that under Assumption 1, the stationary distribution of exists and is denoted by . We assume the following conditions on the data-generating process.

###### Assumption 2.

The data-generating process satisfies:

1. [label=(2-0)]

2. There exists some , such that for all .

3. The average distribution for all .

Under Assumption 2, it is easy to see that for all state-action pair, . Now we can define the ratio function:

 ωπ(s,a)=dπ(s)π(a|s)dD(s,a) (3.4)

The ratio function plays a similar role as the importance weight in finite horizon problems. While the classic importance weight only corrects the distribution of actions between behavior policy and target policy, the ratio here also involves the correction of the states’ distribution. The ratio function is connected with the average reward by

 V(π)=E{1TT∑t=1ωπ(St,At)Rt+1}

An important property of is that for any state-action function ,

 (3.5)

This orthogonality is the key to develop the estimator for (see Section 4.3).

### 3.2 Efficient influence function

In this subsection, we derive the EIF of for a fixed policy under time-homogeneous Markov Decision Process described in Section 2. Recall that the semiparametric efficiency bound is the supremum of the Cramèr-Rao bounds for all parametric submodels (newey1990semiparametric). EIF is defined as the influence function of a regular estimator that achieves the semiparametric efficiency bound. For more details, refer to bickel1993efficient and van2000asymptotic. The EIF of is given by the following theorem.

###### Theorem 3.1.

Suppose the states in the trajectory, , evolve according to the time-homogeneous Markov process and Assumption 2 holds. Consider a policy, , such that Assumption 1 holds. Then the EIF of the average reward, , is

 ϕπ(D)=1TT∑t=1ωπ(St,At){Rt+1+Uπ(St,At,St+1)−V(π)}.

### 3.3 Doubly robust estimator

It is known that EIF can be used to derive a semiparametric estimator (see, for example, Chap. 25 in van2000asymptotic). We follow this approach. Specifically, suppose and are estimators of and respectively. Then we estimate by solving for in the plug-in estimating equation: , where for any function of the trajectory, , the sample average is denoted as . Denote the solution by , which can be expressed as

 ^Vn(π)=Pn[(1/T)∑Tt=1^ωπn(St,At){Rt+1+^Uπn(St,At,St+1)}]Pn{(1/T)∑Tt=1^ωπn(St,At)}. (3.6)

We have the following double robustness of this estimator.

###### Theorem 3.2.

Suppose and converge in probability uniformly to deterministic limits and for every and . If either or , then converges to in probability.

###### Remark 1.

The uniform convergence in probability can be relaxed into

convergence by using uniformly laws of large numbers. The double robustness can protect against potential model mis-specifications since we only require one of two models is correct. Moreover, the double robust structure can relax the required rate for each of the nuisance function estimation to achieve the semiparametric efficiency bound, especially if we use sample-splitting techniques (see the remark below), as discussed in

chernozhukov2018double.

###### Remark 2.

An alternative way to construct the estimator for the average reward is based on the idea of double/debiased machine learning (a.k.a. cross-fitting,

bickel1993efficient and chernozhukov2018double). There is a growing interest of using double machine learning in causal inference and policy learning literature (zhao2019efficient) in order to relax assumptions on the convergence rates of nuisance parameters. The basic idea is to split the data into folds. For each of the folds, construct the estimating equation by plugging in the estimated nuisance functions that are obtained using the remaining folds. The final estimator is obtained by solving the aggregated estimation equations. While cross-fitting requires weaker conditions on the nuisance function estimations, it indeed incurs additional computational cost, especially in our setting where nuisance functions are policy-dependent and we aim to search for the in-class optimal policy. Besides, this sample splitting procedure may not be stable when the sample size is relatively small, e.g., in mHealth study.

## 4 Estimator for Nuisance Functions

Recall the doubly robust estimator (3.6) requires the estimation of two nuisance functions, and . It turns out that although these two nuisance functions are defined from two different perspectives, both nuisance functions can in fact be characterized in a similar way. The estimator is obtained by minimizing an objective function that involves a minimizer of another objective function (hence we calls it “coupled”). In what follows we provide a general coupled estimation framework and discuss the motivation for using it. We then review the coupled estimator for relative value function and ratio function in liao2019off.

### 4.1 Coupled estimation framework

Consider a setting where the true parameter (or function), , can be characterized as the minimizer of the following objective function:

 θ∗=argminθJ(θ)=E{(l1∘fθ)(Z)} (4.1)

where

is a loss function composite with

(e.g., the squared loss, and the linear model, , where ), and

is some random vector. If we can directly evaluate

(e.g., in a regression problem where is the residual), then we can estimate by the classic M-estimator, .

The setting in which we will encounter when estimating the nuisance functions is that is of the form , where is another random vector and cannot be directly evaluated because we don’t have access to the conditional expectation. A natural idea to remedy this is to replace the unknown by and estimate by . Unfortunately this estimator is biased in general. To see this, suppose . We note that the limit of the new objective function, , is then where . The minimizer of is not necessarily unless further conditions are imposed (e.g., is independent of , which is often not the case in our setting).

The high level idea of coupled estimation is to first estimate for each , denoted by , and then estimate by the plug-in estimator, . A standard empirical risk minimization can be applied to obtain a consistent estimator for , e.g., for some loss function and a function space, to approximate . We call the estimator coupled because the objective function (i.e., ) involves which itself is an minimizer of another objective function (i.e., ) for each .

### 4.2 Relative value function estimator

Recall the doubly robust estimator requires an estimate of . It is enough to learn one specific version of . More specifically, define a shifted value function by for some specific state-action pair . By restricting to , the solution of Bellman equations (3.2) is unique and given by . Below we derive a coupled estimator for the shifted value function, , using the coupled estimation framework in Section 4.1.

Let be the transition sample at time . For a given pair, let

 δπ(Zt;η,Q)=Rt+1+∑a′π(a′|St+1)Q(St+1,a)−Q(St,At)−η

be the so-called temporal difference (TD) error. The Bellman equation then becomes for all state-action pair, . As a result, we have

 {V(π),~Qπ}∈argminη,QE[1TT∑t=1(E[δπ(Zt;η,Q)|St,At])2].

Note that above we choose the squared loss for simplicity; a general loss function can also be applied. We see that it fits in the coupled estimation framework presented in the previous section. In particular, and becomes the Bellman error, i.e., . The above characterization involves the average reward, . Thus we need to jointly estimate both the relative value function and the average reward.

We use and to denote two classes of functions of state-action. We use to model the shifted value function and thus require for all . We use to approximate the Bellman error. In addition, and are two regularizers that measure the complexities of these two functional classes respectively. Given the tuning parameters , the coupled estimator, denoted by , is obtained by solving

 (^ηπn,^Qπn)=argmin(η,Q)∈R×FPn[1TT∑t=1^gπn(St,At;η,Q)2]+λnJ21(Q), (4.2)

where is the projected Bellman error at :

 ^gπn(⋅,⋅;η,Q)=argming∈GPn[1TT∑t=1(δπ(Zt;η,Q)−g(St,At))2]+μnJ22(g). (4.3)

Given the estimated (shifted) relative value function, , we form the estimation of by

Throughout this paper, we assume that the tuning parameters are policy-free, that is, does not depend on the policy. In the setting where the policy class is highly complex and the corresponding relative value functions are very different, it could be beneficial to select the tuning parameters locally at a cost of higher computation burden.

Recall that the goal here is to estimate relative value function and then plug in the doubly robust estimator (3.6). The above is only used to help estimate the relative function. In fact, liao2019off proposed using to estimate the average reward. The advantage of our doubly robust estimator (3.6) is that the consistency is guaranteed as long as one of the nuisance function is estimated consistently (Theorem 3.2).

### 4.3 Ratio function estimator

Below we derive the coupled estimator for the ratio function, using the coupled estimation framework. Below we first introduce a scaled version of the ratio function (i.e., ), which is then used to define a function (i.e., ), akin to the relative value function. Then we show the estimation of fits in the coupled estimation framework.

 eπ(s,a)=ωπ(s,a)∑~s,~aωπ(~s,~a)dπ(~s)π(~a|~s). (4.4)

By definition, . Viewing as a “reward function”, the “average reward” of is constant and equal to zero under Assumption 1. In addition, we can define the “relative value function” of policy under the new MDP:

 Hπ(s,a)=limt∗→∞1t∗t∗∑t=1Eπ[t∑k=1{1−eπ(Sk,Ak)}∣∣S1=s,A1=a]. (4.5)

Note that is well-defined under Assumption 1. Furthermore, consider the following Bellman-like equation:

 Eπ{1−eπ(St,At)+H(St+1,At+1)|St=s,At=a}=H(s,a). (4.6)

Note that since the “average reward” is zero, i.e., , the above equation only involves . The set of solutions of (4.6) can be shown to be .

Below we construct a coupled estimator for a shifted version of , i.e., . Recall is the transition sample at time . For a given state-action function, , let . As a result of the above Bellman-like equation and the orthogonality property (3.5), we know that

 ~Hπ∈argminHE[1TT∑t=1(E[Δπ(Zt;H)|St,At])2].

Now it can be seen that the estimation of fits into the coupled estimation framework (4.1), i.e., and is . With a slight abuse of notation, we use to approximate and to form the approximation of . The coupled estimator, , is then found by solving

 (4.7)

where for any , solves

 ^gπn(⋅,⋅;H)=argming∈GPn[1TT∑t=1(Δπ(Zt;H)−g(St,At))2]+μ′nJ22(g). (4.8)

Recall that can be written in terms of by (4.6). Given the estimator, , we estimate by . By the definition of , we have . Since is a scaled version of up to a constant, we finally construct the estimator for ratio, , by scaling , that is,

 (4.9)
###### Remark 3.

The ratio function estimator is the same as the one developed in liao2019off, while here we provide more insights and its connection to the framework of couple estimation. More importantly, in the following section, we provide an finite-sample error bound for this ratio function estimator held uniformly over the policy class, as an essential step to establish the regret bound for our learned policy. This ratio function estimator is different from those in the most existing literature, such as liu2018breaking; uehara2019minimax; nachum2019dualdice; zhang2020gendice, which are obtained by min-max based estimating methods. For example, liu2018breaking aimed to estimate the ratio between stationary distribution induced by a known, Markovian time-stationary behavior policy and target policy, which is then used to estimate the average reward of a given policy. This is not suitable for the setting where the behavior policy is history dependent and the observational study. uehara2019minimax estimated the ratio, , based on the observation that for every state-action function ,

 E[1TT∑t=1(ωπ(St,At)∑a′π(a′|St+1)f(St+1,a)−ωπ(St,At)f(St,At))]=0,

with the restriction that . Then they constructed their estimator by solving the empirical version of the following min-max optimization problem:

 minw∈Δmaxf∈F′E2[T∑t=1(ωπ(St,At)∑a′π(a′|St+1)f(St+1,a)−ωπ(St,At)f(St,At))],

where is a simplex space and is a set of discriminator functions. This method minimizes the upper bound of the bias of their average reward estimator if the state-action value function is contained in . They proved consistency of their ratio and average reward estimators in the parametric setting, that is, where can be modelled parametrically and is a finite dimensional space. Subsequently zhang2020gendice developed a general min-max based estimator by considering variational -divergence, which subsumes the case in uehara2019minimax. Unfortunately, there are no error bounds guarantee for ratio function estimators developed in the two cited papers. Our ratio estimator appears closely related to the estimation developed by nachum2019dualdice as they also formulated the ratio estimator as a minimizer of a loss function. However, relying on the Fenchel’s duality theorem, they still use the min-max based method to estimate the ratio. Furthermore, their method cannot be applied in average reward settings. Instead of using min-max based estimators, we, in this section, propose using coupled estimation. This will facilitate the derivation of estimation error bounds as will be seen below. We will derive the estimation error of the ratio function, which will enable us to provide a strong theoretical guarantee, and finally demonstrate the efficiency of our average reward estimator without imposing restrictive parametric assumptions on the nuisance function estimations, see Section 5 below.

## 5 Theoretical Results

### 5.1 Regret bound

In this section, we provide a finite sample bound on the regret of defined in (2.4), i.e., the difference between the optimal average reward in the policy class and the average reward of the estimated policy, .

Consider a state-action function, . Let be the identity operator, i.e., . Denote the conditional expectation operator by Let the expectation under stationary distribution induced by be . Denote by the total variation distance of two probability measures. For a function , define . For a set and , let be the class of bounded functions on such that . Denote by the -covering number of a set of functions, , with respect to the norm, .

We make use of the following assumption on the policy class .

###### Assumption 3.

The policy class, , satisfies:

1. [label=(3-0)]

2. is compact and is finite.

3. There exists , such that for and , the following holds

 |πθ(a|s)−πθ2(a|s)|≤LΘ∥θ1−θ2∥2.
4. There exists some constants and , such that for every , the following hold for all :

 ∥Pπ(St=⋅|S1=s)−dπ(⋅)∥tv≤C0βt, (5.1) ∥(Pπ)tf−μπ(f)∥≤C0∥f∥βt. (5.2)
###### Remark 4.

The Lipschitz property of the policy class 2 is used to control the complexity of nuisance function induced by , that is, and . This is commonly assumed in the finite-time horizon problems (e.g., zhou2017residual). Our analysis can be extended to more general policy class if similar complexity property holds for these two function classes. Intuitively the constant in the assumption 3 relates to the “mixing time” of the Markov chain induced by . A similar assumption was used by van1998learning; liao2019off in average reward setting.

Recall that we use the same pair of function classes in the coupled estimation for both and . We make use of the following assumptions on .

###### Assumption 4.

The function classes, , satisfy the following:

1. [label=(4-0)]

2. and

3. .

4. The regularization functionals, and , are pseudo norms and induced by the inner products and , respectively.

5. Let and . There exists and such that for any ,

 max{logN(ϵ,GM,∥⋅∥∞),logN(ϵ,FM,∥⋅∥∞)}≤C1(Mϵ)2α
###### Remark 5.

The boundedness assumption on and are used to simplify the analysis and can be relaxed by truncating the estimators. We restrict for all because is used to model and , which by definition satisfies and . In Section 6, we show how to shape an arbitrary kernel function to ensure this is satisfied automatically when is RKHS. The complexity assumption 4 on and are satisfied for common function classes, for example RKHS and Sobolev spaces (steinwart2008support; gyorfi2006distribution).

We now introduce the assumption that is used to bound the estimation error of value function uniformly over the policy class. Define the projected Bellman error operator:

 g∗π(⋅,⋅;η,Q):=argming∈GE[1TT∑t=1{δπ(Zt;η,Q)−g(St,At)}2]
###### Assumption 5.

The triplet, , satisfies the following:

1. [label=(5-0)]

2. for and .

3. .

4. There exits , such that

5. There exists two constants such that holds for all , and .

###### Remark 6.

Note that in the coupled estimator of , we do not require the much stronger condition that the Bellman error for every tuple of is correctly modeled by . In other words, does not necessarily stay in . Instead, the combination of conditions 2 and 3 is enough to guarantee the consistency of the coupled estimator (recall that the Bellman error is zero at ). The last condition 4 essentially requires the transition matrix is sufficiently smooth so that the complexity of the projected Bellman error, , can be controlled by , the complexity of (see farahmand2016regularized for an example).

A similar set of conditions are employed to bound the estimation of ratio function. For and , define the projected error:

 g∗π(⋅,⋅;H)=argming∈GE[1TT∑t=1{Δπ(Zt;H)−g(St,At)}2].
###### Assumption 6.

The triplet, , satisfies the following:

1. [label=(6-0)]

2. For , , and .

3. , for .

4. There exits , such that .

5. There exists two constants such that holds for and .

###### Remark 7.

As in the case of estimation of relative value function, we do not require the correct modelling of