Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health

12/30/2019 ∙ by Peng Liao, et al. ∙ 22

With the recent advancements in wearables and sensing technology, health scientists are increasingly developing mobile health (mHealth) interventions. In mHealth interventions, mobile devices are used to deliver treatment to individuals as they go about their daily lives, generally designed to impact a near time, proximal outcome such as stress or physical activity. The mHealth intervention policies, often called Just-In-time Adaptive Interventions, are decision rules that map a user's context to a particular treatment at each of many time points. The vast majority of current mHealth interventions deploy expert-derived policies. In this paper, we provide an approach for conducting inference about the performance of one or more such policies. In particular, we estimate the performance of a mHealth policy using historical data that are collected under a possibly different policy. Our measure of performance is the average of proximal outcomes (rewards) over a long time period should the particular mHealth policy be followed. We provide a semi-parametric efficient estimator as well as the confidence intervals. This work is motivated by HeartSteps, a mobile health physical activity intervention.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the recent advancements in wearables and sensing technology, health scientists are increasingly developing mobile health (mHealth) interventions. In mHealth interventions, mobile devices are used to deliver treatment to individuals as they go about their daily lives, i.e., in vivo. There is a wide variety of such treatments (behavioral, cognitive, motivational, reminders and so on) mostly in the form of messages that appear on a wearable or on a smartphone either as a notification or as a text message. The treatments are generally designed to impact a near time, proximal outcome such as stress or behaviors such as physical activity over some subsequent minutes/hours. The mHealth intervention policies, often called Just-In-time Adaptive Interventions (Nahum-Shani et al., 2017) in the mHealth literature, are decision rules that maps a user’s context (e.g., user’s past behaviors, location, current time, social activity, stress and urges to smoke) to a particular treatment at each of many time points. The vast majority of current mHealth interventions deploy expert-derived policies; for an example see (Kizakevich et al., 2014). In this paper we provide an approach for conducting inference about the performance of one or more such policies. In particular we estimate the performance of an mHealth policy using historical data that are collected under a possibly different policy. Our measure of performance is the average of proximal outcome over a long time period should the particular mHealth policy be followed. We provide an estimator as well as confidence intervals.

This paper is motivated by HeartSteps (Klasnja et al., 2015), a mobile health physical activity intervention. To design this intervention we are conducting a series of studies. The first, already completed, study was for 42 days. The last study will be for one year. Here we focus on the intervention component involving contextually-tailored activity suggestion messages. These messages may be delivered at each of five user-specified times per day. While in the first study there are time points per user, in the year-long study there will be over 3,600 time points per user. The proximal outcome is the step count in the 30 minutes following each of the five times per day. Our goal is to use the data collected from the first 42-day study to estimate the long-term (e.g., here approximately 3,600 time points) average of post-decision 30-min step counts for a variety of policies that could be used to decide whether or not to send the contextually tailored activity suggestion. The 42-day study is a Micro-Randomized Trial (MRT) (Liao et al., 2016; Klasnja et al., 2015). In an MRT, a known stochastic policy, also called a behavior policy, is used to decide when and which type of treatment to provide at each time point. A partial list of MRTs in the field or completed can be found at the website of The Methodology Center (https://www.methodology.psu.edu/ra/adap-inter/mrt-projects) . From an experimental point of view, the stochastic behavior policy is used to conduct sequential randomizations within each user. That is, the behavior policy sequentially randomizes each user among treatments. In this paper we focus on settings in which the behavior policy is known; this is the case with MRTs by design.

In the following two sections, we review Markov Decision Processes, a probabilistic model for sequential decision making, and the related work. In Section 4 we present our method. The main theoretical results are presented in Section 5. A simulation study is conducted to assess the performance in Section 6. A case study using data from the 42 day MRT of HeartSteps is presented in Section 7. We end with a discussion of future work in Section 8.

2 Review of Markov Decision Process

We model the sequential decision making process as a Markov Decision Process (MDP) involving

where indexes time, is the state at time and is the action selected at time . We assume the action space is finite and the data-generating process is Markovian, e.g., for , , and this conditional distribution is time-invariant. In mHealth, the action is the treatment and the state, , contains the time-varying contextual information (e.g. current activity, stress, location) and summaries of historical data up to and including time (e.g. summary of previous physical activity). Denote the transition kernel by ; given a measurable set of the state space , . Denote by the transition density with respect to some reference measure on (e.g., counting measure when is discrete). The reward or (proximal) outcome, denoted by , is a known function of . In the case of HeartSteps, is the log of total step count in the 30 minutes following time . We also use to denote the conditional expectation of reward given state and action, i.e., .

We assume the reward for all is bounded and the action is determined by a stochastic behavior policy . The behavior policy,

is a mapping that takes the state as input and outputs a probability distribution on the action space

; is the probability of selecting action given state . Note that this policy is time-invariant as is the conditional distribution of given . In this paper, we evaluate the policy using the long-term average reward. Specifically, the average reward of a policy is defined as

(1)

where the expectation, , is taken over the trajectory in which the actions are selected according to the policy , that is, the likelihood in the expectation is given by . The policy

induces a Markov chain on the states with transition kernel

.

Suppose for now the state space is finite. It is well known (Puterman, 1994) that when the induced markov chain is irreducible and aperiodic, the average reward in (1) exists and is independent of the initial state, i.e.,

(2)

where is the stationary distribution (the existence of is guaranteed by irreducibility and aperiodicity of (Puterman, 1994)). Furthermore, we can define the (relative) value function

(3)

It is easy to verify by definition that is the solution of the Bellman equation (also known as Poisson equation) (Puterman, 1994), given as follows: for all

(4)

Furthermore, the Bellman equations uniquely identify the average reward and identifies the relative value function up to a constant. That is, the set of solutions of (4) is given by ; see (Puterman (1994), p. 343) for details. The above results can be generalized to general state space, e.g., , with more involved conditions on the transition kernel , analogous to irreducibility and aperiodic in the finite state case; see for example (Hernández-Lerma & Lasserre (1999), chap. 7). The key requirement for the proposed method presented in Section 4 is that the average reward is a constant and can be uniquely identified by solving Bellman equations (4). Motivated by our mHealth application, we will consider a generalization that allows the average reward to depend on a time-invariant variable in Appendix. In the case of mHealth, time-invariant variables might be gender, baseline severity, genetics and so on.

In this paper we develop methods to conduct inference about the average reward for pre-specified target policies, ’s. An alternate, and more common, evaluation measure is based on an expected discounted sum of rewards, with the discount rate, . However, note that even with a discount rate of , the rewards at time have a weight of and the rewards at time have a weight of . Recall our motivating mHealth intervention is being designed to be used for one year with over time points. With this in mind, we opt for the long-term average of rewards. Since as the discount rate the above conditional expectation of the sum of discounted rewards scaled by the normalization constant converges to the average reward (Sutton et al., 1998) and because the Bellman operator in the discounted setting is a contraction (Sutton et al., 1998), researchers focus on a discounted sum of rewards to ensure online computational stability and simplify associated convergence arguments. However as we shall see below, consideration of the average reward is not problematic in the batch (i.e., off-line) setting.

3 Related Work

The evaluation of a given target policy using data collected from different policy (the behavior policy) is called off-policy evaluation. This has been widely studied in both the statistical and reinforcement learning (RL) literature. Many authors evaluate and contrast policies in terms of the expected average of rewards over a finite number of time points (Murphy et al., 2001; Chakraborty & Moodie, 2013; Jiang & Li, 2015)

. However because these methods use products of weights with probabilities from the behavior policy in the denominator, the extension to problems with a large number of time points often suffers from large variance

(Thomas & Brunskill, 2016; Jiang & Li, 2015).

The most common off-policy evaluation methods for infinite-horizon problems (e.g. large number of time points) focus on a discounted sum of rewards and are thus based in some way on the value function (in the discounted reward setting considered as a function of is the value function). For example in (Farahmand et al., 2016), a regularized version of Least Square Temporal Difference (Bradtke et al., 1996)

was proposed and statistical properties studied. They used a non-parametric model to estimate the value function and derived the convergence rate when training data consists of i.i.d. transition sample that consists of current state, action and next state. From a technical point of view our estimation method is similar to

(Farahmand et al., 2016) albeit focused on the average reward; most importantly our method relaxes the assumption that Bellman operator can be modeled correctly for each candidate relative value function and only assumes the data consists of i.i.d. samples of trajectories.

Luckett et al. (2019) also focused on the discounted reward setting. They evaluated policies, based on an average of with respect to a pre-selected reference distribution for . While the reference distribution can be naturally chosen as the distribution of the initial state (Liu et al., 2018; Luckett et al., 2019; Thomas & Brunskill, 2016), choosing a “right” discount rate, can be non-trivial, at least in mHealth. They assumed a parametric model for value function and developed a regularized estimating equations.

Most closest to the setting of this paper is the recent work by Liu et al. (2018). They proposed an estimator of the average reward for a target policy. This estimator involves estimation of the ratio of the stationary distribution under the target policy divided by the stationary distribution under the behavior policy. However they do not provide confidence intervals or other inferential methods besides an estimator of the average reward. In addition, they assume that the behavior policy is Markovian and time-stationary. In mHealth the behavior policy can be determined by an algorithm based on accruing data and thus violates this assumption (Liao et al., 2018; Dempsey et al., 2017).

4 Estimator for Off-Policy Evaluation

We consider the setting where we have observations of independent, identically distributed trajectories:

We assume the length of trajectory, , is non-random and identical for each trajectory for simplicity. Each trajectory is assumed to follow a MDP in which the actions selected according a behavior policy , i.e., where is the history collected up to time . Throughout we assume that the behavior policy is known as would be the case for an MRT and furthermore the behavior policy results in strictly positive randomization probabilities, e.g., for all , and . In the following, the expectation without the subscript is assumed taken with respect to the distribution of the trajectory with the actions selected by the behavior policy . For any state-action function and distribution , denote the norm by . If the norm does not have a subscript, then the expectation is with respect to the data distribution, that is, .

Let be some target policy, possibly different from the behavior policy in the training data. Throughout we only consider the target policy that is Markovian (i.e., only depends on the current state) and time-invariant (i.e. the mapping does not vary with time). Our goal is to learn , the average reward of the target policy. We assume the following regularity condition.

Assumption 1.

The average reward of the target policy is independent of state and satisfies (2). is the unique solution of Bellman equations (4) up to a constant for . The stationary distribution of induced transition kernel exists.

Recall from Section 2 that this assumption holds automatically if the state space is finite and the transition probability matrix induced by the target policy is irreducible and aperiodic. Note that the above assumption does not allow for an average reward that might differ by user demographics (gender and so on) and we develop such extension in Appendix. We consider a model-free approach to estimate the average reward based on Bellman equations (4). Recall that the Bellman equations only identify the relative value function, up to a constant. As the focus of this paper is to estimate the average reward, we only need to estimate one specific version of . Define the shifted relative value function by for a specific state-action pair . Obviously and , e.g., the difference in the relative value remains the same. By restricting the function class s.t., , the solution of Bellman equations (4) is unique and given by .

In the following, we use

to denote a vector space of functions on the state-action space

such that for all , in which we will assume . The Bellman operator with respect to the target policy is given by

(5)

Note that the above expectation does not depend on the behavior policy as we condition on the current state and action. The Bellman error at w.r.t. is defined as . This error is at , . That is, from the Bellman equations (4) we have for all states, actions . Note that even when , the output of the Bellman operator is not necessarily in the function space . To see this, use the fact that to obtain

Depending on the complexity of the transition kernel, the last term is unlikely to be in for every . Assuming the last term is in for every is a much stronger than requiring . Instead, as we will see, we only need to form an approximation to the Bellman error. We introduce a second function class to form a projected error:

(6)

In practice, one can use the same function class for both and , except that we require members of to satisfy ; a similar constraint need not be placed on the members of . And in addition we do not assume that the error is in . The key reason why the projected Bellman error from (6) allows us to identify is that .

We now construct an estimator for . In what follows, for ease of notation we use to denote the estimates of , the shifted relative value function. For an arbitrary function , let be the empirical mean over the training data . Recall that the Bellman operator in (5) is conditional expectation of given . The estimator is found by simultaneously minimizing

(7)

and

(8)

where and are regularizers, and and are tuning parameters. The penalty term is used to balance between the model fitting, i.e. the squared estimated Bellman error and the complexity of the relative value function, measured by . Similarly, is used to control the overfitting in estimating the Bellman error when the space is complex. In the case where the function space is -th order Sobolev space, the regularizer is typically defined by the -th order derivative to capture the smoothness of function. In the case where the function space is Reproducing Kernel Hilbert Space (RKHS), the regularizer is the endowed norm. In Supplement LABEL:appendix:_calculation, we provide a closed-form solution of the estimator when both and are RKHSs.

5 Theoretical Results

In this section, we derive the global rate of convergence for in (8) and derive the asymptotic distribution of . We start with two standard assumptions used in non-parametric regression literature (Györfi et al., 2006).

Assumption 2.

The reward is uniformly bounded: . The shifted relative value function is bounded: for all .

In the above recall that the reward is a known function of .

Assumption 3.

The function class satisfies that

(i) and for all and

(ii) , where is the shifted relative value function.

The assumption of a bounded reward is mainly to simplify the proof and can be relaxed to the sub-Gaussian case, e.g. the error is sub-Gaussian for all . The boundedness assumption of the relative value function can be ensured by assuming certain smoothness assumptions on the transition distribution (Ortner & Ryabko, 2012) or assuming geometric convergence to the stationary distribution (Hernández-Lerma & Lasserre, 1999). The boundedness assumption for is used to simplify the proof; a truncation argument can be used to avoid this assumption.

Unlike in non-parametric regression, in which the dependent variable is fully observed, the dependent variable here is which involves the conditional expectation of the unknown relative value function . This motivates the two, coupled minimizations in (7) and (8) and we use additional assumptions to manage this. Recall in 6 the Bellman error is approximated by . We make the following assumption about the complexity of to ensure that the estimator is consistent.

Assumption 4.

The function class satisfies that

(i) for all and

(ii) where is given in 6

In the above recall that for state-action function , . The value of measures how well the function class approximates the Bellman error for all in which and . The condition of a strict positive value of ensures the estimator (8) based on minimizing the projected Bellman error onto the space is able to identify the true parameters

. This is similar to the eigenvalue condition (Assumption 5) in

(Luckett et al., 2019) but they are essentially using the same function class for and . Note that unlike in (Farahmand et al., 2016), here we do not assume for every . If this were the case, then we would have that and thus .

Below we make assumptions on the complexity of the function classes and . These assumptions are satisfied for common function classes, for example RKHS and Sobolev spaces (Van de Geer, 2000; Zhao et al., 2016; Steinwart & Christmann, 2008; Györfi et al., 2006).

Assumption 5.

(i) The regularization functional and are pseudo norms and induced by the inner products and , respectively. For all , there exists two constants such that .

(ii) Let and . There exists some constants and such that for any ,

The upper bound on is realistic when the transition model is sufficiently smooth; see (Farahmand et al., 2016) for an example of MDP satisfying this condition. We use a common for both and in (ii) to simply the proof.

Now we are ready to state the theorem about the convergence rate for in terms of Bellman error.

Theorem 1 (Global Convergence Rate).

Suppose Assumptions 1-6 hold. Let be the estimator defined in (8). Then, for the tuning parameters satisfying (i) and (ii) , we have

In Lemma LABEL:lemma:_upper_bound_of_eta,_Q in Supplement LABEL:appendix:_proof_of_thm_1, we show that up to a constant and thus when we see that is consistent estimate of . When the tuning parameters are chosen s.t. and , the Bellman error at , i.e., has the optimal rate of convergence. The proof of Theorem 1 is provided in Supplement LABEL:appendix:_proof_of_thm_1.

In what follows, we derive the asymptotic distribution of the estimator of average reward. In particular, we show that under certain assumptions the estimator from (8) is consistent when the tuning parameters are chosen appropriately. Let be density of the stationary distribution of state-action under the target policy and to be the density of the distribution of the states-action pairs, under the behavior policy , averaged across the decision times ().

We introduce some additional notation. Motivated by the least favorable direction in partial linear regression problems

(Van de Geer, 2000; Zhao et al., 2016), we define the direction by

(9)

The above-defined direction is used to control the bias caused by the penalization on the non-parametric relative value function in the estimator (8). This is akin to partial linear regression problem in which the analog of is ; see (Van de Geer, 2000). In our setting, the direction satisfies the following orthogonality property: for any state-action function ,

(10)

by noting that . Note that is a scaled version of , the ratio between the stationary distribution under target policy and the distribution in the trajectory . The denominator is the expectation of the ratio under the stationary distribution and it is greater than 1:

since in the last equality the variance is with respect to the distribution of state-action pairs collected from the behavior policy and .

Next define . Note has a similar structure to the structure of the relative value function in (3) with the reward at time being and the average reward being ( by definition). Similarly, satisfies a Bellman-like equation:

(11)

We make the following smoothness assumption about and , akin to the assumptions used in partially linear regression literature (Van de Geer, 2000; Zhao et al., 2016).

Assumption 6.

The shifted function and .

The condition is imposed to ensure the rate of convergence and asymptotic normality of . On the other hand, unlike in the regression setting we further assume is smooth enough (i.e., ) due to not knowing the conditional expectation . Note that this assumption (i.e., ) would hold automatically if we assume the much stronger condition that for every (i.e., in Assumption 4).

Our last assumption is a contraction-type property. This is imposed to control the variance of the remainder term caused by the nuisance function estimation.

Assumption 7.

Denote the conditional expectation operator and the expectation of a state-action function under stationary distribution induced by by . There exists some constants and , such that for and

(12)

The parameter is akin to the discount factor in the discounted reward setting. Intuitively, this is related to the “mixing rate” of the Markov chain induced by the target policy . A similar assumption was considered in (Van Roy, 1998) (Assumption 7.2 on pg. 99). Now we are ready to present our main result of this paper, the asymptotic normality of the proposed estimator.

Theorem 2 (Asymptotic Normality).

Suppose the conditions in Theorem 1 hold. In addition, suppose Assumption 6 and 7 hold and , the estimates defined in (7) is -consistency and asymptotic normal: , where

Furthermore, matches the semi-parametric efficiency bound of .

From Theorem 2, we see the variance in estimating the average reward parameter depends on the ratio between the stationary distribution of state-action pair induced by the target policy and the average state-action distribution in the training data. The closer of these two distributions implies a smaller variance of estimating the average reward. In the special case where the target policy is same as the behavior policy (i.e., on-policy evaluation) and the states in the training data follows the stationary distribution (e.g., when the length of trajectory is sufficiently large), one should expect to see the smallest variance. A similar definition of ratio is used in (Liu et al., 2018) to build an estimator of . Although here we only focus on the asymptotic property of for large ( is the number of i.i.d. trajectories), one can see that increasing length of trajectory, reduces the variance, as inside of the variance is an average over decision time points. So far we’ve only considered evaluating one single target policy. The method can also be used to evaluate and contrast the multiple policies. In Appendix Appendix

, we develop such extension and derive the asymptotic joint distribution of the estimated average rewards. Our proposed estimator also achieves the semi-parametric efficiency bound. To the best of our knowledge, this is the first semi-parametric efficient estimator for the average reward.

To construct the confidence interval of , we need to estimate the asymptotic variance . This can be done by plugging in and an estimate of the ratio . Taking expectation of both sides of (9) gives Given the estimator of ,, we then estimate the ratio by . Motivated by the orthogonality (10) and expression (11), we estimate by , where for , and and are tuning parameters. It is easy to verify that this is a consistent estimator of under the assumptions listed in Theorem 2. In Supplement LABEL:appendix:_calculation, we provide a closed-form solution for the estimator of asymptotic variance when and are RKHSs.

6 Simulation

In this section, we evaluate the performance of proposed method using simulated data. We first describe the generative model. The states is two-dimensional and the action is binary: treat versus nothing. Given the current state and action , the next state is generated by where and . The use of and mimics the treatment burden and the self-efficacy in that sending treatment () increases and decreases . The reward . The behavior policy is to choose with probability 0.5. We consider evaluating the “always treat” policy.

In the estimation, we use RKHS with radial basis function kernel to construct the functional class

and ; the details of how to construct an appropriate kernel such that can be found in Supplement LABEL:appendix:_calculation. Recall that the estimator in (8) involves the tuning parameters . Following the idea in (Farahmand & Szepesvári, 2011), we select these tuning parameters as follows. We first split the dataset into a training set and a validation set. For each candidate value of tuning parameters, the training set is used to form the estimates . Then the temporal difference error is calculated for each transition sample in the validation set. We then use the validation set to fit a model to estimate the Bellman error w.r.t. , which is the conditional expectation of the temporal difference error. We then select the tuning parameters that minimize the squared estimated Bellman error in the validation set. The final estimate of is re-calculated with the selected tuning parameters using the entire dataset. In the simulation, we use (2/3) of data for the training set and (1/3) for the validation set and we use Gaussian Process regression to estimate the Bellman error in the validation step. The confidence interval is calculated.

In the simulation, we consider scenarios of , the number of trajectories and , the length of each trajectory and under each scenario the coverage probability is calculated based on 256 repetitions. The simulation result is reported in Table 1. When the number of trajectories is small (i.e., ), the simulated coverage probability is slightly smaller than the claimed value, 0.95, especially when the length is small . It can be seen that the coverage probability slightly improves when increases. When , the coverage probability becomes close to 0.95. Overall the simulation result demonstrates the validity of the selection procedure for the tuning parameters. It also suggests that it is necessary to perform a small-sample correction when both and are small. This is left for future work.

n T Coverage Prob. MSE n T Coverage Prob. MSE
25 20 0.922 0.0456 40 25 0.941 0.0245
25 40 0.941 0.0214 40 50 0.941 0.0134
25 80 0.938 0.0128 40 75 0.953 0.0070
Table 1: Coverage probability of the Confidence Interval and mean squared error of the estimated average reward over 256 repetitions.

7 Case Study: HeartSteps

We apply the proposed method to the dataset from the first study in HeartSteps (Klasnja et al., 2015). This is a 42-day Micro-randomized trial involving 37 healthy sedentary adults. At each of the five user-specified times in a day, a contextually-tailored activity suggestion message is sent with probability 0.6 if the participant is considered to be available for treatment (e.g., when the participant is currently walking or driving). Each participant wears a Jawbone wrist tracker and the step count data is collected. The activity messages are intended to help participants to walk when delivered. We construct a target policy based on the location. Specifically, we consider the target policy: when the participant is currently available and at home or work location, send the activity message; otherwise, do nothing. This policy is of interest because people tend to be more sedentary at home or work and sending a walking suggestion message is likely to be most effective compared with at other locations.

We construct the state based on the participant’s step count data (e.g., 30-min step count prior to the treatment, yesterday’s total step count), location, temperature and the number of notifications received over the last 7 days. We also include in the state the time slot index in the day (1 to 5) and the indicator measuring how the step count varies at the current time slot over the last 7 days. The reward is formed by the log transformation of the total step count collected 30 min after the decision time. The log transformation is used as the step count data is highly noisy and positively skewed

(Klasnja et al., 2018)

; the mean post-decision time step count is 248 across all decision times with standard deviation 467. In the estimation, we use RKHS to form the function classes

and with the radial basis function kernel. The tuning parameters are selected based on the procedure described in Section 6. The estimated average reward of the target policy is 3.284 (SD = 0.157) and the 95% C.I is . As a reference, the empirical reward across all participants and study day is 3.075. The -value for the one-sided test is 0.091. Translating back to the raw step count (Klasnja et al., 2018), the target policy is able to increase the average 30-min step count by (), corresponding to steps.

8 Discussion

In this work, we developed a flexible non-parametric method to conduct inference about the the long-term average outcome for given target policies. Further development is needed. For example in many mobile health applications, non-stationarity occurs due to unobserved aspects of the current context (e.g., the engagement and/or burden). For example, the mapping from states to reward is likely to be different over time. It will be interesting to generalize the current framework of long-term average reward to the non-stationary setting. Alternatively, one can consider evaluating the treatment policy in the indefinite horizon setting where there is an absorbing state (akin to the user disengaging from the mobile app) and thus we aim to conduct inference about the expected total rewards until the absorbing state is reached. Lastly while evaluating a given target policy or comparing a small number of policies is an important first step towards improving mobile health, it is crucial to develop methods for estimating the optimal policy, that is, the policy which leads to the largest positive health behavior rewards, as well as the inferential method for assessing the usefulness of certain variables in the estimated optimal policy.

Appendix

In this appendix, we develop two generalizations of results developed in the main body of the paper. We first generalize the policy evaluation of a single policy to a class of policies and derive the asymptotic of the estimated average rewards. We then consider the setting where there is a time-invariant state and the average reward of the target policy depends on the time-variant state.

Off-policy evaluation of multiple policies

Let be a class of finite target policies. Denote by the estimated average reward of policy from (8). Suppose for all policy , the Assumption 1-8 are satisfied and . The proof of Theorem 1 and 2 can be easily extended to hold simultaneously for all policy

and we can show that the estimated average rewards jointly converge in distribution to a multivariate Gaussian distribution. Specifically, for any

,

where the covariance matrix is given by for . Here