1 Introduction
We study the problem of policy optimization in Markov Decision Process over infinite time horizons (puterman1994markov). We focus on the batch (i.e., offline) setting, where historical data of multiple trajectories has been previously collected using some behavior policy. Our goal is to learn a new policy with guaranteed performance when implemented in the future. In this work, we develop a dataefficient method to learn the policy that optimizes the longterm average reward in a prespecified policy class from a training set composed of multiple trajectories. Furthermore, we establish a finitesample regret guarantee, i.e., the difference between the average reward of the optimal policy in the class and the average reward of the estimated policy by our proposed method. This work is motivated by the development of justintime adaptive intervention in mobile health (mHealth) applications (Nahum2017). Our method can be used to learn a treatment policy that maps the realtime collected information about the individual’s status and context to a particular treatment at each of many decision times to support health behaviors.
Sequential decisionmaking problem has been extensively studied in statistics (dynamic treatment regime, murphy2003optimal), econometrics (welfare maximization, manski2004statistical
) and computer science (reinforcement learning,
sutton2018reinforcement). Recently tremendous progress has been made in developing efficient methods that use historical data to learn the optimal policy with performance guaranteed in the finitetime horizon setting; see the recent papers by zhou2017residual; athey2017efficient; kallus2018balanced; zhao2019efficient in the singledecision point problem and zhao2015new; luedtke2016super; nie2019learning in the multipledecision point problem and the recent review paper by kosorok2019precision for references therein. Many mHealth applications are designed for the longterm use and often involve a large number of decision time points (e.g., hundreds or thousands). For example in HeartSteps, a physical activity mHealth study, there are five decision times per day, resulting in thousands of decision times over a yearlong study. Many existing methods developed for the finitetime horizon problem are based on the idea of importance sampling (precup2000eligibility), which involves the products of importance weights between the behavior policy and the target policy. This may suffer a large variance especially in problems with a large number of time points
(voloshin2019empirical) as in the case of mHealth. Methods that are based on the idea of backward iteration (e.g., Qlearning) also becomes unpractical as the number of horizon increases (laber2014dynamic).In this work, we adopt the infinite time horizon homogeneous MDP framework. Although the training data consists of trajectories of finite length, the Markov and timestationarity assumptions make it possible to evaluate and optimize the policy over infinite time horizons. In infinite time horizon setting, the majority of existing methods focuses on optimizing the discounted sum of rewards (sutton2018reinforcement); see the recent works in statistics luckett2019estimating; ertefaie2018constructing; shi2020statistical. The discounted formulation weighs immediate rewards more heavily than rewards further in the future, which is practical in some applications (e.g., finance). The contraction property of Bellman operator due to discounting also simplifies associated analyses (tsitsiklis2002average; sutton2018reinforcement). For mHealth applications, choosing an appropriate discount rate could be nontrivial. The rewards (i.e., the health outcomes) in the distant future are as important as the nearterm ones, especially when considering the effect of habituation and burden. This suggests using a large discount rate. However, it is well known that algorithms developed in the discounted setting can become increasingly unstable as the discount rate goes to one; see for example naik2019discounted.
We propose using the longterm average reward as the criterion in optimizing the policy. The average reward formulation has a long history in dynamic programming (howard1960dynamic) and reinforcement learning (mahadevan1996average). In fact, the longterm average reward can be viewed as the limiting version of the discounted sum of rewards as the discount rate approaches one (bertsekas1995dynamic). We believe that the average reward framework provides a good approximation to the longterm performance of a desired treatment policy in mHealth. Indeed, it can be shown that under regularity conditions the average of the expected rewards collected over finite time horizon converges sublinearly to the average reward as time goes to infinity (hernandez2012further). Therefore, a policy that optimizes the average reward would approximately maximize the sum of the rewards over the long time horizon.
In the three settings discussed above, e.g., finite horizon or infinite horizon discounted sum of rewards or infinite horizon average reward, many methods consider finding the optimal policy (with no restriction) by first estimating the optimal value and then recovering the optimal policy, see for example ormoneit_kernelbased_2003; lagoudakis2003least; ernst_treebased_2005; munos2008finite; antos_learning_2008; antos2008fitted; ertefaie2018constructing; yang2019sample. A critical assumption behind these methods is the correct modeling of the possibly nonsmooth optimal value function, which could be highly complex in practice and thus requires the use of a very flexible function class. The use of the flexible function class usually results in a learned policy that is also complex. If interpretability is important, this is problematic. Furthermore, when the training data is limited, the flexible function class could overfit the data and thus the variance of the estimated value function and the corresponding policy could be high.
We instead aim to learn the optimal policy in a prespecified policy class; see for example zhang2012robust; zhang2013robust; zhou2017residual; zhao2015new; zhao2019efficient; athey2017efficient in finite time horizon problems and luckett2019estimating; murphy2016batch; liu2019off in infinite time horizon problems. One can use prior knowledge to design the policy class (e.g., selection of the variables into the policy) and thus ensure the interpretability of the learned policy. The restriction to a parsimonious policy class reduces the variance of the learned policy, although this induces the bias when the optimal policy is not in the class (i.e., trading off the bias and variance). We consider a class of parametric, stochastic (i.e., randomized) policies. Recall the motivation of this work is to construct a good treatment policy for use in a future study. To facilitate the analysis after the study is over (e.g., causal inference or offpolicy evaluation/learning) , we focus on stochastic policies. Furthermore, it is important to ensure sufficient exploration, which can be controlled by restricting the policy class, i.e., putting constraints on the parameter space (see Section 6 for an example). Similar to ours, murphy2016batch considered the average reward formulation and developed the “batch, offpolicy actor–critic” algorithm to learn the optimal policy in a class. Unfortunately, they did not provide any performance guarantees. luckett2019estimating
considered the infinite horizon discounted reward setting and also developed an interesting method to estimate the optimal policy in a parameterized policy class. They evaluated each policy by the discounted sum of rewards where the initial state is averaged over some reference distribution. Under the parametric assumption of the value function, they showed that the estimated optimal value converges to a Gaussian distribution and the estimated policy parameters converge in probability. However, they did not provide the regret guarantee of the learned policy.
In order to efficiently learn the policy, the main challenge is to construct a good estimator for evaluating policies that is both dataefficient and performs uniformly well when optimizing over the policy class. For this purpose, we develop a novel doubly robust estimator for the average reward of a given policy and show that the proposed estimator achieves the semiparametric efficiency bound under certain conditions on the estimation error of nuisance functions (See Section 5). Estimating the value of a policy is known as the offpolicy policy evaluation (OPE) problem in the literature of the computer science community. Doubly robust estimators have been developed in the finite time horizon problem (robins1994estimation; murphy2001marginal; dudik2014doubly; thomas2016data) and recently in the discounted reward infinite horizon setting (kallus2019double; Tang2020Doubly). Our estimator involves two policydependent nuisance functions: the relative stateaction value function and the ratio function between the stateaction stationary distribution under the target policy and the marginal stateaction distribution in the training data. The estimator is doubly robust in the sense that the estimated average reward is consistent when one of these nuisance function estimators is consistent. Recently, liao2019off developed an estimator for the average reward based on minimizing the projected Bellman error (bradtke_linear_1996), which only involved a single nuisance function, the relative value function. They showed that their estimator is asymptotic normal under certain conditions. When the stateaction value function is incorrectly modeled, their estimator may incur a large bias. liu2018breaking also developed an estimator for the average reward based on the ratio function. Although they did not provide a theoretical guarantee of their estimator, it can be seen that the consistency of the estimator may require correct modeling of the ratio function. In contrast to these two methods, our estimator is doubly robust, thus providing an additional protection against model misspecification. As will be seen in Section 5, the double robustness property of the estimator ensures that the estimation errors of the nuisance functions has minimal impact on the estimation of average reward, which ultimately leads to the semiparametric efficiency and the optimality of the regret bound.
To use the doubly robust estimator, we need to estimate the two nuisance functions. We use the coupled estimation framework to estimate the two nuisance functions. The estimation procedure is coupled in the sense that the estimator is obtained by minimizing an objective function which involves the minimizer of another optimization problem. As opposed to the supervised learning problems (e.g., regression), the coupled estimation is used to resolve the issue that the outcome variable depends on the target (e.g., value function), commonly arising in reinforcement learning problems
(antos_learning_2008). The idea of coupled estimation was previously used in antos_learning_2008; farahmand2016regularized to estimate the value function in the discounted reward setting. Recently liao2019off developed the coupled estimator for relative value function in the average reward setting and derived the finite sample bound for a fixed policy. While the ratio function is defined very differently to the value function, it turns out that similar to the value function, the ratio function can also be characterized as a minimizer of some objective function (see Section 4.3). liao2019off used this characterization to derive a coupled estimator for the ratio function, but they did not provide the finite sample analysis for the ratio estimator.As a prerequisite to guarantee the semiparametric efficiency of the doubly robust estimator and, more importantly, to establish the regret bound, we derive finitesample error bounds for both of the nuisance function estimators and the obtained error bounds are shown to hold uniformly over the prespecified class of policies. Although the relative value and ratio estimators are both derived from the same principle (i.e., coupled estimation), it is much harder to bound the estimation error for the ratio function. This is mainly because in the case of value function, the Bellman error being zero at the true value function greatly simplifies the analysis (see Section 5 for details). We use an iterative procedure to handle this and obtain a nearoptimal error bound for the ratio estimator. To the best of our knowledge, this is the first theoretical result characterizing the ratio estimation error, which might be of independent interest. Recently, researchers started realizing the important role of the ratio function in OPE problems and designed various estimators (liu2018breaking; uehara2019minimax; nachum2019dualdice; zhang2020gendice). Different from the coupled formulation used in this work, these methods used a minmax loss formulation. More importantly, they did not provide a theoretical guarantee on the estimation error compared with our work.
We learn the optimal policy by maximizing the estimated average reward over a policy class and derive a finitesample upper bound of the regret. We show that the our proposed method achieves regret, where is the number of parameters in the policy, is the number of trajectories in the training data and is a constant that can be chosen arbitrarily close to . Here measures the complexity of function classes in which the nuisance functions are assumed to stay. The use of doubly robust estimation ensures the estimation error of the nuisance functions is only of lower order in the regret. Unlike in the finite horizon setting (athey2017efficient), the regret analysis in the infinite time setting requires an uniform control of the estimation error of the policydependent nuisance functions over the policy class, which makes our analysis much more involved. We believe this is the first regret bound result for infinite time horizon problems in the batch setting.
The rest of the article is organized as follows. Section 2 formalizes the decision making problem and introduces the average reward MDP. Section 3 presents the proposed method of learning the inclass optimal policy, including the doubly robust estimator for average reward (Sec. 3.3). In Section 4, the coupled estimators of the policydependent nuisance functions are introduced. Section 5 provides a thorough theoretical analysis on the regret bound of our proposed method. In Section 6, we describe a practical optimization algorithm when Reproducing Kernel Hilbert Spaces (RKHSs) are used to model the nuisance functions. We further conduct several simulation studies to demonstrate the promising performance of our method in Section 7. All the technical proofs are postponed to the supplementary material.
2 Problem Setup
Suppose we observe a training dataset, that consists of independent, identically distributed (i.i.d.) observations of :
We use to index the decision time. The length of the trajectory, , is assumed nonrandom. is the state at time and is the action (treatment) selected at time . We assume the action space, , is finite. To eliminate unnecessary technical distractions, we assume that the state space, , is finite; this assumption imposes no practical limitations and can be extended to the general state space.
The states evolve according to a timehomogeneous Markov process. For , , and the conditional distribution does not depend on . Denote the conditional distribution by , i.e., . The reward (i.e., outcome) is denoted by , which is assumed to be a known function of , i.e., . We assume the reward is bounded, i.e., . We use to denote the conditional expectation of reward given state and action, i.e., .
Let be the history up to time and the current state, . Denote the conditional distribution of given by . Let . This is often called behavior policy in the literature. In this work we do not require to know the behavior policy. Throughout this paper, the expectation, , without any subscript is assumed taken with respect to the distribution of the trajectory, , with the actions selected by the behavior policy .
Consider a timestationary, Markovian policy,
, that takes the state as input and outputs a probability distribution on the action space,
, that is, is the probability of selecting action, , at state, . The average reward of the policy, , is defined as(2.1) 
where the expectation, , is with respect to the distribution of the trajectory in which the states evolve according to and the actions are chosen by . Note that the limit in (2.1) always exists as is finite (puterman1994markov). The policy,
, induces a Markov chain of states with the transition as
. When the induced Markov chain, , is irreducible, it can be shown (e.g., in puterman1994markov) that the stationary distribution of exists and is unique (denoted by ) and the average reward, (2.1) is independent of initial state (denoted by ) and equal to(2.2) 
Throughout this paper we consider only the timestationary, Markovian policies. In fact, it can be shown that the maximal average reward among all possible history dependent policies can be in fact achieved by some timestationary, Markovian policy (Theorem 8.1.2 in puterman1994markov). Consider a prespecified class of such policies, , that is parameterized by . Throughout we assume that the induced Markov chain is always irreducible for any policy in the class, which is summarized below.
Assumption 1.
For every , the induced Markov chain, , is irreducible.
The goal of this paper is to develop a method that can efficiently use the training data, , to learn the policy that maximizes the average reward over the policy class. We propose to construct , an efficient estimator for the average reward, , for each policy and learn the optimal policy by solving
(2.3) 
The performance of is measured by its regret, defined as
(2.4) 
3 Doubly Robust Estimator for Average Reward
In this section we present a doubly robust estimator for the average reward for a given policy. The estimator is derived from the efficient influence function (EIF). Below we first introduce two functions that occur in the EIF of the average reward. Throughout this section we fix some timestationary Markovian policy, , and focus on the setting where the induced Markov chain, , is irreducible (Assumption 1).
3.1 Relative value function and ratio function
First, we define the relative value function by
(3.1) 
The above limit is welldefined (puterman1994markov, p. 338). If we further assume the induced Markov chain is aperiodic, then the Cesàro limit in (3.1) can be replaced by . is often called relative value function in that represents the expected total difference between the reward and the average reward under the policy, , when starting at state, , and action, .
The relative value function, , and the average reward, , are closely related by the Bellman equation:
(3.2) 
Note that in the above expectation . It is known that under the irreducibility assumption, the set of solutions of (3.2) is given by where for all ; see puterman1994markov, p. 343 for details. As we will see in Section 4.2, the Bellman equation provides the foundation of estimating the relative value function. Note that in the Bellman equation (and the efficient influence function presented in the next section), occurs in the form of the difference between the average value of the next state and the value of the current stateaction pair. It would be convenient to define
(3.3) 
We now introduce the ratio function. For , let be the probability mass of stateaction pair at time in generated by the behavior policy, . Denote by the average probability mass across the decision times in of length . Similarly, define as the marginal distribution of and as the average distribution of states in the trajectory . Recall that under Assumption 1, the stationary distribution of exists and is denoted by . We assume the following conditions on the datagenerating process.
Assumption 2.
The datagenerating process satisfies:

[label=(20)]

There exists some , such that for all .

The average distribution for all .
Under Assumption 2, it is easy to see that for all stateaction pair, . Now we can define the ratio function:
(3.4) 
The ratio function plays a similar role as the importance weight in finite horizon problems. While the classic importance weight only corrects the distribution of actions between behavior policy and target policy, the ratio here also involves the correction of the states’ distribution. The ratio function is connected with the average reward by
An important property of is that for any stateaction function ,
(3.5) 
This orthogonality is the key to develop the estimator for (see Section 4.3).
3.2 Efficient influence function
In this subsection, we derive the EIF of for a fixed policy under timehomogeneous Markov Decision Process described in Section 2. Recall that the semiparametric efficiency bound is the supremum of the CramèrRao bounds for all parametric submodels (newey1990semiparametric). EIF is defined as the influence function of a regular estimator that achieves the semiparametric efficiency bound. For more details, refer to bickel1993efficient and van2000asymptotic. The EIF of is given by the following theorem.
3.3 Doubly robust estimator
It is known that EIF can be used to derive a semiparametric estimator (see, for example, Chap. 25 in van2000asymptotic). We follow this approach. Specifically, suppose and are estimators of and respectively. Then we estimate by solving for in the plugin estimating equation: , where for any function of the trajectory, , the sample average is denoted as . Denote the solution by , which can be expressed as
(3.6) 
We have the following double robustness of this estimator.
Theorem 3.2.
Suppose and converge in probability uniformly to deterministic limits and for every and . If either or , then converges to in probability.
Remark 1.
The uniform convergence in probability can be relaxed into
convergence by using uniformly laws of large numbers. The double robustness can protect against potential model misspecifications since we only require one of two models is correct. Moreover, the double robust structure can relax the required rate for each of the nuisance function estimation to achieve the semiparametric efficiency bound, especially if we use samplesplitting techniques (see the remark below), as discussed in
chernozhukov2018double.Remark 2.
An alternative way to construct the estimator for the average reward is based on the idea of double/debiased machine learning (a.k.a. crossfitting,
bickel1993efficient and chernozhukov2018double). There is a growing interest of using double machine learning in causal inference and policy learning literature (zhao2019efficient) in order to relax assumptions on the convergence rates of nuisance parameters. The basic idea is to split the data into folds. For each of the folds, construct the estimating equation by plugging in the estimated nuisance functions that are obtained using the remaining folds. The final estimator is obtained by solving the aggregated estimation equations. While crossfitting requires weaker conditions on the nuisance function estimations, it indeed incurs additional computational cost, especially in our setting where nuisance functions are policydependent and we aim to search for the inclass optimal policy. Besides, this sample splitting procedure may not be stable when the sample size is relatively small, e.g., in mHealth study.4 Estimator for Nuisance Functions
Recall the doubly robust estimator (3.6) requires the estimation of two nuisance functions, and . It turns out that although these two nuisance functions are defined from two different perspectives, both nuisance functions can in fact be characterized in a similar way. The estimator is obtained by minimizing an objective function that involves a minimizer of another objective function (hence we calls it “coupled”). In what follows we provide a general coupled estimation framework and discuss the motivation for using it. We then review the coupled estimator for relative value function and ratio function in liao2019off.
4.1 Coupled estimation framework
Consider a setting where the true parameter (or function), , can be characterized as the minimizer of the following objective function:
(4.1) 
where
is a loss function composite with
(e.g., the squared loss, and the linear model, , where ), andis some random vector. If we can directly evaluate
(e.g., in a regression problem where is the residual), then we can estimate by the classic Mestimator, .The setting in which we will encounter when estimating the nuisance functions is that is of the form , where is another random vector and cannot be directly evaluated because we don’t have access to the conditional expectation. A natural idea to remedy this is to replace the unknown by and estimate by . Unfortunately this estimator is biased in general. To see this, suppose . We note that the limit of the new objective function, , is then where . The minimizer of is not necessarily unless further conditions are imposed (e.g., is independent of , which is often not the case in our setting).
The high level idea of coupled estimation is to first estimate for each , denoted by , and then estimate by the plugin estimator, . A standard empirical risk minimization can be applied to obtain a consistent estimator for , e.g., for some loss function and a function space, to approximate . We call the estimator coupled because the objective function (i.e., ) involves which itself is an minimizer of another objective function (i.e., ) for each .
4.2 Relative value function estimator
Recall the doubly robust estimator requires an estimate of . It is enough to learn one specific version of . More specifically, define a shifted value function by for some specific stateaction pair . By restricting to , the solution of Bellman equations (3.2) is unique and given by . Below we derive a coupled estimator for the shifted value function, , using the coupled estimation framework in Section 4.1.
Let be the transition sample at time . For a given pair, let
be the socalled temporal difference (TD) error. The Bellman equation then becomes for all stateaction pair, . As a result, we have
Note that above we choose the squared loss for simplicity; a general loss function can also be applied. We see that it fits in the coupled estimation framework presented in the previous section. In particular, and becomes the Bellman error, i.e., . The above characterization involves the average reward, . Thus we need to jointly estimate both the relative value function and the average reward.
We use and to denote two classes of functions of stateaction. We use to model the shifted value function and thus require for all . We use to approximate the Bellman error. In addition, and are two regularizers that measure the complexities of these two functional classes respectively. Given the tuning parameters , the coupled estimator, denoted by , is obtained by solving
(4.2) 
where is the projected Bellman error at :
(4.3) 
Given the estimated (shifted) relative value function, , we form the estimation of by
Throughout this paper, we assume that the tuning parameters are policyfree, that is, does not depend on the policy. In the setting where the policy class is highly complex and the corresponding relative value functions are very different, it could be beneficial to select the tuning parameters locally at a cost of higher computation burden.
Recall that the goal here is to estimate relative value function and then plug in the doubly robust estimator (3.6). The above is only used to help estimate the relative function. In fact, liao2019off proposed using to estimate the average reward. The advantage of our doubly robust estimator (3.6) is that the consistency is guaranteed as long as one of the nuisance function is estimated consistently (Theorem 3.2).
4.3 Ratio function estimator
Below we derive the coupled estimator for the ratio function, using the coupled estimation framework. Below we first introduce a scaled version of the ratio function (i.e., ), which is then used to define a function (i.e., ), akin to the relative value function. Then we show the estimation of fits in the coupled estimation framework.
We start with introducing :
(4.4) 
By definition, . Viewing as a “reward function”, the “average reward” of is constant and equal to zero under Assumption 1. In addition, we can define the “relative value function” of policy under the new MDP:
(4.5) 
Note that is welldefined under Assumption 1. Furthermore, consider the following Bellmanlike equation:
(4.6) 
Note that since the “average reward” is zero, i.e., , the above equation only involves . The set of solutions of (4.6) can be shown to be .
Below we construct a coupled estimator for a shifted version of , i.e., . Recall is the transition sample at time . For a given stateaction function, , let . As a result of the above Bellmanlike equation and the orthogonality property (3.5), we know that
Now it can be seen that the estimation of fits into the coupled estimation framework (4.1), i.e., and is . With a slight abuse of notation, we use to approximate and to form the approximation of . The coupled estimator, , is then found by solving
(4.7) 
where for any , solves
(4.8) 
Recall that can be written in terms of by (4.6). Given the estimator, , we estimate by . By the definition of , we have . Since is a scaled version of up to a constant, we finally construct the estimator for ratio, , by scaling , that is,
(4.9) 
Remark 3.
The ratio function estimator is the same as the one developed in liao2019off, while here we provide more insights and its connection to the framework of couple estimation. More importantly, in the following section, we provide an finitesample error bound for this ratio function estimator held uniformly over the policy class, as an essential step to establish the regret bound for our learned policy. This ratio function estimator is different from those in the most existing literature, such as liu2018breaking; uehara2019minimax; nachum2019dualdice; zhang2020gendice, which are obtained by minmax based estimating methods. For example, liu2018breaking aimed to estimate the ratio between stationary distribution induced by a known, Markovian timestationary behavior policy and target policy, which is then used to estimate the average reward of a given policy. This is not suitable for the setting where the behavior policy is history dependent and the observational study. uehara2019minimax estimated the ratio, , based on the observation that for every stateaction function ,
with the restriction that . Then they constructed their estimator by solving the empirical version of the following minmax optimization problem:
where is a simplex space and is a set of discriminator functions. This method minimizes the upper bound of the bias of their average reward estimator if the stateaction value function is contained in . They proved consistency of their ratio and average reward estimators in the parametric setting, that is, where can be modelled parametrically and is a finite dimensional space. Subsequently zhang2020gendice developed a general minmax based estimator by considering variational divergence, which subsumes the case in uehara2019minimax. Unfortunately, there are no error bounds guarantee for ratio function estimators developed in the two cited papers. Our ratio estimator appears closely related to the estimation developed by nachum2019dualdice as they also formulated the ratio estimator as a minimizer of a loss function. However, relying on the Fenchel’s duality theorem, they still use the minmax based method to estimate the ratio. Furthermore, their method cannot be applied in average reward settings. Instead of using minmax based estimators, we, in this section, propose using coupled estimation. This will facilitate the derivation of estimation error bounds as will be seen below. We will derive the estimation error of the ratio function, which will enable us to provide a strong theoretical guarantee, and finally demonstrate the efficiency of our average reward estimator without imposing restrictive parametric assumptions on the nuisance function estimations, see Section 5 below.
5 Theoretical Results
5.1 Regret bound
In this section, we provide a finite sample bound on the regret of defined in (2.4), i.e., the difference between the optimal average reward in the policy class and the average reward of the estimated policy, .
Consider a stateaction function, . Let be the identity operator, i.e., . Denote the conditional expectation operator by Let the expectation under stationary distribution induced by be . Denote by the total variation distance of two probability measures. For a function , define . For a set and , let be the class of bounded functions on such that . Denote by the covering number of a set of functions, , with respect to the norm, .
We make use of the following assumption on the policy class .
Assumption 3.
The policy class, , satisfies:

[label=(30)]

is compact and is finite.

There exists , such that for and , the following holds

There exists some constants and , such that for every , the following hold for all :
(5.1) (5.2) 
Remark 4.
The Lipschitz property of the policy class 2 is used to control the complexity of nuisance function induced by , that is, and . This is commonly assumed in the finitetime horizon problems (e.g., zhou2017residual). Our analysis can be extended to more general policy class if similar complexity property holds for these two function classes. Intuitively the constant in the assumption 3 relates to the “mixing time” of the Markov chain induced by . A similar assumption was used by van1998learning; liao2019off in average reward setting.
Recall that we use the same pair of function classes in the coupled estimation for both and . We make use of the following assumptions on .
Assumption 4.
The function classes, , satisfy the following:

[label=(40)]

and

.

The regularization functionals, and , are pseudo norms and induced by the inner products and , respectively.

Let and . There exists and such that for any ,
Remark 5.
The boundedness assumption on and are used to simplify the analysis and can be relaxed by truncating the estimators. We restrict for all because is used to model and , which by definition satisfies and . In Section 6, we show how to shape an arbitrary kernel function to ensure this is satisfied automatically when is RKHS. The complexity assumption 4 on and are satisfied for common function classes, for example RKHS and Sobolev spaces (steinwart2008support; gyorfi2006distribution).
We now introduce the assumption that is used to bound the estimation error of value function uniformly over the policy class. Define the projected Bellman error operator:
Assumption 5.
The triplet, , satisfies the following:

[label=(50)]

for and .

.

There exits , such that

There exists two constants such that holds for all , and .
Remark 6.
Note that in the coupled estimator of , we do not require the much stronger condition that the Bellman error for every tuple of is correctly modeled by . In other words, does not necessarily stay in . Instead, the combination of conditions 2 and 3 is enough to guarantee the consistency of the coupled estimator (recall that the Bellman error is zero at ). The last condition 4 essentially requires the transition matrix is sufficiently smooth so that the complexity of the projected Bellman error, , can be controlled by , the complexity of (see farahmand2016regularized for an example).
A similar set of conditions are employed to bound the estimation of ratio function. For and , define the projected error:
Assumption 6.
The triplet, , satisfies the following:

[label=(60)]

For , , and .

, for .

There exits , such that .

There exists two constants such that holds for and .
Remark 7.
As in the case of estimation of relative value function, we do not require the correct modelling of