1 Introduction
Contextual bandits (Auer et al., 2002/03; Langford and Zhang, 2008
), sometimes known as associative reinforcement learning (
Barto and Anandan, 1985), are a natural generalization of the classic multiarmed bandits introduced by Robbins (1952). In a contextual bandit problem, the decision maker observes contextual information, based on which an action is chosen out of a set of candidates; in return, a numerical “reward” signal is observed for the chosen action, but not for others. The process repeats for multiple steps, and the goal of the decision maker is to maximize the total rewards in this process. Usually, contexts observed by the decision maker provide useful information to infer the expected reward of each action, thus allowing greater rewards to be accumulated, compared to standard multiarmed bandits, which take no account of the context.Many problems in practice can be modeled by contextual bandits. For example, in one type of Internet advertising, the decision maker (such as a website) dynamically selects which ad to display to a user who visits the page, and receives a payment from the advertiser if the user clicks on the ad (e.g., Chapelle and Li, 2012). In this case, the context can be the user’s geographical information, the action is the displayed ad and the reward is the payment. Importantly, we find only whether a user clicked on the presented ad, but receive no information about the ads that were not presented.
Another example is content recommendation on Web portals (Agarwal et al., 2013). Here, the decision maker (the web portal) selects, for each user visit, what content (e.g., news, images, videos and music) to display on the page. A natural objective is to “personalize” the recommendations, so that the number of clicks is maximized (Li et al., 2010). In this case, the context is the user’s interests in different topics, either selfreported by the user or inferred from the user browsing history; the action is the recommended item; the reward can be defined as if the user clicks on an item, and otherwise.
Similarly, in health care, we only find out the clinical outcome (the reward) of a patient who received a treatment (action), but not the outcomes for alternative treatments. In general, the treatment strategy may depend on the context of the patient such as her health level and treatment history. Therefore, contextual bandits can also be a natural model to describe personalized treatments.
The behavior of a decision maker in contextual bandits can be described as a policy, to be defined precisely in the next sections. Roughly speaking, a policy is a function that maps the decision maker’s past observations and the contextual information to a distribution over the actions. This paper considers the offline version of contextual bandits: we assume access to historical data, but no ability to gather new data (Langford, Strehl and Wortman, 2008; Strehl et al., 2011). There are two related tasks that arise in this setting: policy evaluation and policy optimization. The goal of policy evaluation is to estimate the expected total reward of a given policy. The goal of policy optimization is to obtain a policy that (approximately) maximizes expected total rewards. The focus of this paper is on policy evaluation, but as we will see in the experiments, the ideas can also be applied to policy optimization. The offline version of contextual bandits is important in practice. For instance, it allows a website to estimate, from historical log data, how much gain in revenue can be achieved by changing the adselection policy to a new one (Bottou et al., 2013). Therefore, the website does not have to experiment on real users to test a new policy, which can be very expensive and timeconsuming. Finally, we note that this problem is a special case of offpolicy reinforcement learning (Precup, Sutton and Singh, 2000).
Two kinds of approaches address offline policy evaluation. The first, called the direct method (DM), estimates the reward function from given data and uses this estimate in place of actual reward to evaluate the policy value on a set of contexts. The second kind, called inverse propensity score (IPS) (Horvitz and Thompson, 1952), uses importance weighting to correct for the incorrect proportions of actions in the historic data. The first approach requires an accurate model of rewards, whereas the second approach requires an accurate model of the past policy. In general, it might be difficult to accurately model rewards, so the first assumption can be too restrictive. On the other hand, in many applications, such as advertising, Web search and content recommendation, the decision maker has substantial, and possibly perfect, knowledge of the past policy, so the second approach can be applied. However, it often suffers from large variance, especially when the past policy differs significantly from the policy being evaluated.
In this paper, we propose to use the technique of doubly robust (DR) estimation to overcome problems with the two existing approaches. Doubly robust (or doubly protected) estimation (Cassel, Särndal and Wretman, 1976; Robins, Rotnitzky and Zhao, 1994; Robins and Rotnitzky, 1995; Lunceford and Davidian, 2004; Kang and Schafer, 2007) is a statistical approach for estimation from incomplete data with an important property: if either one of the two estimators (i.e., DM or IPS) is correct, then the estimation is unbiased. This method thus increases the chances of drawing reliable inference.
We apply the doubly robust technique to policy evaluation and optimization in a contextual bandit setting. The most straightforward policies to consider are stationary policies, whose actions depend on the current, observed context alone. Nonstationary policies, on the other hand, map the current context and a history of past rounds to an action. They are of critical interest because online learning algorithms (also known as adaptive allocation rules), by definition, produce nonstationary policies. We address both stationary and nonstationary policies in this paper.
In Section 2, we describe previous work and connect our setting to the related area of dynamic treatment regimes.
In Section 3, we study stationary policy evaluation, analyzing the bias and variance of our core technique. Unlike previous theoretical analyses, we do not assume that either the reward model or the past policy model are correct. Instead, we show how the deviations of the two models from the truth impact bias and variance of the doubly robust estimator. To our knowledge, this style of analysis is novel and may provide insights into doubly robust estimation beyond the specific setting studied here. In Section 4, we apply this method to both policy evaluation and optimization, finding that this approach can substantially sharpen existing techniques.
In Section 5, we consider nonstationary policy evaluation. The main approach here is to use the historic data to obtain a sample of the run of an evaluated nonstationary policy via rejection sampling (Li et al., 2011). We combine the doubly robust technique with an improved form of rejection sampling that makes better use of data at the cost of small, controllable bias. Experiments in Section 6 suggest the combination is able to extract more information from data than existing approaches.
2 Prior Work
2.1 Doubly Robust Estimation
Doubly robust estimation is widely used in statistical inference (see, e.g., Kang and Schafer, 2007, and the references therein). More recently, it has been used in Internet advertising to estimate the effects of new features for online advertisers (Lambert and Pregibon, 2007; Chan et al., 2010). Most of previous analysis of doubly robust estimation is focused on asymptotic behavior or relies on various modeling assumptions (e.g., Robins, Rotnitzky and Zhao, 1994; Lunceford and Davidian, 2004; Kang and Schafer, 2007). Our analysis is nonasymptotic and makes no such assumptions.
Several papers in machine learning have used ideas related to the basic technique discussed here, although not with the same language. For benign bandits, Hazan and Kale (2009) construct algorithms which use reward estimators to improve regret bounds when the variance of actual rewards is small. Similarly, the Offset Tree algorithm (Beygelzimer and Langford, 2009) can be thought of as using a crude reward estimate for the “offset.” The algorithms and estimators described here are substantially more sophisticated.
Our nonstationary policy evaluation builds on the rejection sampling approach, which has been previously shown to be effective (Li et al., 2011). Relative to this earlier work, our nonstationary results take advantage of the doubly robust technique and a carefully introduced bias/variance tradeoff to obtain an empirical orderofmagnitude improvement in evaluation quality.
2.2 Dynamic Treatment Regimes
Contextual bandit problems are closely related to dynamic treatment regime (DTR) estimation/optimization in medical research. A DTR is a set of (possibly randomized) rules that specify what treatment to choose, given current characteristics (including past treatment history and outcomes) of a patient. In the terminology of the present paper, the patient’s current characteristics are contextual information, a treatment is an action, and a DTR is a policy. Similar to contextual bandits, the quantity of interest in DTR can be expressed by a numeric reward signal related to the clinical outcome of a treatment. We comment on similarities and differences between DTR and contextual bandits in more detail in later sections of the paper, where we define our setting more formally. Here, we make a few higherlevel remarks.
Due to ethical concerns, research in DTR is often performed with observational data rather than on patients. This corresponds to the offline version of contextual bandits, which only has access to past data but no ability to gather new data. Causal inference techniques have been studied to estimate the mean response of a given DTR (e.g., Robins, 1986; Murphy, van der Laan and Robins, 2001), and to optimize DTR (e.g., Murphy, 2003; Orellana, Rotnitzky and Robins, 2010). These two problems correspond to evaluation and optimization of policies in the present paper.
In DTR, however, a treatment typically exhibits a longterm effect on a patient’s future “state,” while in contextual bandits the contexts are drawn IID with no dependence on actions taken previously. Such a difference turns out to enable statistically more efficient estimators, which will be explained in greater detail in Section 5.2.
Despite these differences, as we will see later, contextual bandits and DTR share many similarities, and in some cases are almost identical. For example, analogous to the results introduced in this paper, doubly robust estimators have been applied to DTR estimation (Murphy, van der Laan and Robins, 2001), and also used as a subroutine for optimization in a family of parameterized policies (Zhang et al., 2012). The connection suggests a broader applicability of DTR techniques beyond the medical domain, for instance, to the Internetmotivated problems studied in this paper.
3 Evaluation of Stationary Policies
3.1 Problem Definition
We are interested in the contextual bandit setting where on each round:

[3.]

A vector of covariates (or a
context) is revealed. 
An action (or arm) is chosen from a given set .

A reward for the action is revealed, but the rewards of other actions are not. The reward may depend stochastically on and .
We assume that contexts are chosen IID from an unknown distribution , the actions are chosen from a finite (and typically not too large) action set , and the distribution over rewards does not change over time (but is unknown).
The input data consists of a finite stream of triples indexed by . We assume that the actions are generated by some past (possibly nonstationary) policy, which we refer to as the exploration policy. The exploration history up to round is denoted
Histories are viewed as samples from a probability measure
. Our assumptions about data generation then translate into the assumption about factoring of asfor any . Note that apart from the unknown distribution
, the only degree of freedom above is
, that is, the unknown exploration policy.When is clear from the context, we use a shorthand for the conditional distribution over the th triple
We also write and for and .
Given input data , we study the stationary policy evaluation problem. A stationary randomized policy is described by a conditional distribution of choosing an action on each context. The goal is to use the history to estimate the value of , namely, the expected reward obtained by following :
In content recommendation on Web portals, for example, measures the average click probability per user visit, one of the major metrics with critical business importance.
In order to have unbiased policy evaluation, we make a standard assumption that if then for all (and all possible histories ). This clearly holds for instance if for all . Since is fixed in our paper, we will write for . To simplify notation, we extend the conditional distribution to a distribution over triples
and hence .
The problem of stationary policy evaluation, defined above, is slightly more general than DTR analysis in a typical crosssectional observational study, where the exploration policy (known as “treatment mechanism” in the DTR literature) is stationary; that is, the conditional distribution is independent of and identical across all , that is, for all .
3.2 Existing Approaches
The key challenge in estimating policy value in contextual bandits is that rewards are partially observable: in each round, only the reward for the chosen action is revealed; we do not know what the reward would have been if we chose a different action. Hence, the data collected in a contextual bandit process cannot be used directly to estimate a new policy’s value: if in a context the new policy selects an action different from the action chosen during data collection, we simply do not have the reward signal for .
There are two common solutions for overcoming this limitation (see, e.g., Lambert and Pregibon, 2007, for an introduction to these solutions). The first, called the direct method (DM), forms an estimate of the expected reward conditioned on the context and action. The policy value is then estimated by
Clearly, if is a good approximation of the true expected reward , then the DM estimate is close to . A problem with this method is that the estimate is typically formed without the knowledge of , and hence might focus on approximating expected reward in the areas that are irrelevant for and not sufficiently in the areas that are important for (see, e.g., the analysis of Beygelzimer and Langford, 2009).
The second approach, called inverse propensity score (IPS), is typically less prone to problems with bias. Instead of approximating the reward, IPS forms an approximation of , and uses this estimate to correct for the shift in action proportions between the exploration policy and the new policy:
If
, then the IPS estimate above will be, approximately, an unbiased estimate of
. Since we typically have a good (or even accurate) understanding of the datacollection policy, it is often easier to obtain good estimates, and thus the IPS estimator is in practice less susceptible to problems with bias compared with the direct method. However, IPS typically has a much larger variance, due to the increased range of the random variable
. The issue becomes more severe when gets smaller in high probability areas under . Our approach alleviates the large variance problem of IPS by taking advantage of the estimate used by the direct method.3.3 Doubly Robust Estimator
Doubly robust estimators take advantage of both the estimate of the expected reward and the estimate of action probabilities . A similar idea has been suggested earlier by a number of authors for different estimation problems (Cassel, Särndal and Wretman, 1976; Rotnitzky and Robins, 1995; Robins and Rotnitzky, 1995; Murphy, van der Laan and Robins, 2001; Robins, 1998). For the setting in this section, the estimator of Murphy, van der Laan and Robins (2001) can be reduced to
where
is the estimate of derived from . Informally, the doubly robust estimator uses as a baseline and if there is data available, a correction is applied. We will see that our estimator is unbiased if at least one of the estimators, and , is accurate, hence the name doubly robust.
In practice, quite often neither or is accurate. It should be noted that, although tends to be much easier to estimate than in applications that motivate this study, it is rare to be able to get a perfect estimator, due to engineering constraints in complex applications like Web search and Internet advertising. Thus, a basic question is: How does the estimator perform as the estimates and deviate from the truth? The following section analyzes bias and variance of the DR estimator as a function of errors in and . Note that our DR estimator encompasses DM and IPS as special cases (by respectively setting and ), so our analysis also encompasses DM and IPS.
3.4 Analysis
We assume that and , but in general does not need to represent conditional probabilities (our notation is only meant to indicate that estimates , but no probabilistic structure). In general, we allow and to be random variables, as long as they satisfy the following independence assumptions:

is independent of .

is conditionally independent of , conditioned on .
The first assumption means that can be assumed fixed and determined before we see the input data , for example, by initially splitting the input dataset and using the first part to obtain and the second part to evaluate the policy. In our analysis, we condition on and ignore any randomness in its choice.
The second assumption means that is not allowed to depend on future. A simple way to satisfy this assumption is to split the dataset to form an estimator (and potentially also include data ). If we have some control over the exploration process, we might also have access to “perfect logging”, that is, recorded probabilities . With perfect logging, we can achieve , respecting our assumptions.^{2}^{2}2As we will see later in the paper, in order to reduce the variance of the estimator it might still be advantageous to use a slightly inflated estimator, for example, for , or for some .
Analogous to , we define the population quantity
and define similarly to :
Let and denote, respectively, the additive error of and the multiplicative error of :
We assume that for some , with probability one under :
which can always be satisfied by enforcing .
To bound the error of , we first analyze a single term:
We bound its range, bias, and conditional variance as follows (for proofs, see Appendix A):
Lemma 3.1
The range of is bounded as
Lemma 3.2
The expectation of the term is
Lemma 3.3
The variance of the term can be decomposed and bounded as follows:
(i)  
(ii)  
The range of is controlled by the worstcase ratio . The bias of gets smaller as and become more accurate, that is, as and . The expression for variance is more complicated. Lemma 3.3(i) lists four terms. The first term represents the variance component due to the randomness over . The second term can contribute to the decrease in the variance. The final two terms represent the penalty due to the importance weighting. The third term scales with the conditional variance of rewards (given contexts and actions), and it vanishes if rewards are deterministic. The fourth term scales with the magnitude of , and it captures the potential improvement due to the use of a good estimator .
The upper bound on the variance [Lemma 3.3(ii)] is easier to interpret. The first term is the variance of the estimated variable over . The second term measures the quality of the estimators and —it equals zero if either of them is perfect (or if the union of regions where they are perfect covers the support of over and ). The final term represents the importance weighting penalty. It vanishes if we do not apply importance weighting (i.e., and ). With nonzero , this term decreases with a better quality of —but it does not disappear even if is perfect (unless the rewards are deterministic).
3.4.1 Bias analysis
Lemma 3.2 immediately yields a bound on the bias of the doubly robust estimator, as stated in the following theorem. The special case for stationary policies (second part of the theorem) has been shown by Vansteelandt, Bekaert and Claeskens (2012).
Theorem 3.4
Let and be defined as above. Then the bias of the doubly robust estimator is
If the exploration policy and the estimator are stationary (i.e., and for all ), the expression simplifies to
The theorem follows immediately from Lemma 3.2. In contrast, we have (for simplicity, assuming stationarity of the exploration policy and its estimate)
where the first equality is based on the observation that DM is a special case of DR with (and hence ), and the second equality is based on the observation that IPS is a special case of DR with (and hence ).
In general, neither of the estimators dominates the others. However, if either , or , the expected value of the doubly robust estimator will be close to the true value, whereas DM requires and IPS requires . Also, if [for a suitable norm], we expect that DR will outperform DM. Similarly, if but , we expect that DR will outperform IPS. Thus, DR can effectively take advantage of both sources of information to lower the bias.
3.4.2 Variance analysis
We argued that the expected value of compares favorably with IPS and DM. We next look at the variance of DR. Since largedeviation bounds have a primary dependence on variance; a lower variance implies a faster convergence rate. To contrast DR with IPS and DM, we study a simpler setting with a stationary exploration policy, and deterministic target policy , that is, puts all the probability on a single action. In the next section, we revisit the fully general setting and derive a finitesample bound on the error of DR.
Theorem 3.5
Let and be defined as above. If exploration policy and the estimator are stationary, and the target policy is deterministic, then the variance of the doubly robust estimator is
The theorem follows immediately from Lemma 3.3(i).
The variance can be decomposed into three terms. The first term accounts for the randomness in (note that is deterministic given ). The other two terms can be viewed as the importance weighting penalty. These two terms disappear in DM, which does not use rewards . The second term accounts for randomness in rewards and disappears when rewards are deterministic functions of and . However, the last term stays, accounting for the disagreement between actions taken by and .
Similar expressions can be derived for the DM and IPS estimators. Since IPS is a special case of DR with , we obtain the following equation:
The first term will be of similar magnitude as the corresponding term of the DR estimator, provided that . The second term is identical to the DR estimator. However, the third term can be much larger for IPS if and is smaller than for the actions chosen by .
In contrast, for the direct method, which is a special case of DR with , the following variance is obtained immediately:
Thus, the variance of the direct method does not have terms depending either on the exploration policy or the randomness in the rewards. This fact usually suffices to ensure that its variance is significantly lower than that of DR or IPS. However, as mentioned in the previous section, when we can estimate reasonably well (namely, ), the bias of the direct method is typically much larger, leading to larger errors in estimating policy values.
3.4.3 Finitesample error bound
By combining bias and variance bounds, we now work out a specific finitesample bound on the error of the estimator . While such an error bound could be used as a conservative confidence bound, we expect it to be too loose in most settings (as is typical for finitesample bounds). Instead, our main intention is to explicitly highlight how the errors of estimators and contribute to the final error.
To begin, we first quantify magnitudes of the additive error of the estimator , and the relative error of the estimator : Assume there exist such that
and with probability one under :
Recall that . In addition, our analysis depends on the magnitude of the ratio and a term that captures both the variance of the rewards and the error of . Assume there exist such that with probability one under , for all :
With the assumptions above, we can now bound the bias and variance of a single term . As in the previous sections, the bias decreases with the quality of and , and the variance increases with the variance of the rewards and with the magnitudes of the ratios , . The analysis below for instance captures the biasvariance tradeoff of using for some : such a strategy can lead to a lower variance (by lowering and ) but incurs some additional bias that is controlled by the quality of .
The bias and variance bound follow from Lemma 3.2 and Lemma 3.3(ii), respectively, by Hölder’s inequality.
Using the above lemma and Freedman’s inequality yields the following theorem.
The proof follows by Freedman’s inequality (Theorem B.1 in Appendix B), applied to random variables , whose range and variance are bounded using Lemmas 3.1 and 3.6.
The theorem is a finitesample error bound that holds for all sample size , and in the limit the error converges to
. As we mentioned, this result gives a confidence interval for the doublyrobust estimate
for any finite sample . Other authors have used asymptotic theory to derive confidence intervals for policy evaluation by showing that the estimator is asymptotically normal (e.g., Murphy, van der Laan and Robins, 2001; Zhanget al., 2012). When using asymptotic confidence bounds, it can be difficult to know a priori whether the asymptotic distribution has been reached,whereas our bound applies to all finite sample sizes. Although our bound may be conservative for small sample sizes, it provides a “safe” nonasymptotic confidence interval. In certain applications like those on the Internet, the sample size is usually large enough for this kind of nonasymptotic confidence bound to be almost as small as its asymptotic value (the term in Theorem 3.7), as demonstrated by Bottou et al. (2013) for online advertising.4 Experiments: the Stationary Case
This section provides empirical evidence for the effectiveness of the DR estimator compared to IPS and DM. We study these estimators on several realworld datasets. First, we use public benchmark datasets for multiclass classification to construct contextual bandit data, on which we evaluate both policy evaluation and policy optimization approaches. Second, we use a proprietary dataset to model the pattern of user visits to an Internet portal. We study covariate shift, which can be formalized as a special case of policy evaluation. Our third experiment uses another proprietary dataset to model slotting of various types of search results on a webpage.
4.1 Multiclass Classification with Partial Feedback
We begin with a description of how to turn a class classification dataset into a armed contextual bandit dataset. Instead of rewards, we will work with losses, specifically the 01classification error. The actions correspond to predicted classes. In the usual multiclass classification, we can infer the loss of any action on training data (since we know its correct label), so we call this a full feedback setting. On the other hand, in contextual bandits, we only know the loss of the specific action that was taken by the exploration policy, but of no other action, which we call a partial feedback setting. After choosing an exploration policy, our transformation from full to partial feedback simply “hides” the losses of actions that were not picked by the exploration policy.
This protocol gives us two benefits: we can carry out comparison using public multiclass classification datasets, which are more common than contextual bandit datasets. Second, fully revealed data can be used to obtain ground truth value of an arbitrary policy. Note that the original data is realworld, but exploration and partial feedback are simulated.
4.1.1 Data generation
In a classification task, we assume data are drawn IID from a fixed distribution: , where is a realvalued covariate vector and
is a class label. A typical goal is to find a classifier
minimizing the classification error:where is an indicator function, equal to if its argument is true and otherwise.
The classifier can be viewed as a deterministic stationary policy with the action set and the loss function
Loss minimization is symmetric to the reward maximization (under transformation ), but loss minimization is more commonly used in classification setting, so we work with loss here. Note that the distribution together with the definition of the loss above, induce the conditional probability in contextual bandits, and minimizing the classification error coincides with policy optimization.
To construct partially labeled data in multiclass classification, it remains to specify the exploration policy. We simulate stationary exploration with for all . Hence, the original example is transformed into an example for a randomly selected action . We assume perfect logging of the exploration policy and use the estimator . Below, we describe how we obtained an estimator (the counterpart of ).
4.1.2 Policy evaluation
We first investigate whether the DR technique indeed gives more accurate estimates of the policy value (or classification error in our context), compared to DM and IPS. For each dataset:

[3.]

We randomly split data into training and evaluation sets of (roughly) the same size;

On the training set, we keep full classification feedback of form and train the direct loss minimization (DLM) algorithm of McAllester, Hazan and Keshet (2011), based on gradient descent, to obtain a classifier (see Appendix D for details). This classifier constitutes the policy whose value we estimate on evaluation data;

We compute the classification error on fully observed evaluation data. This error is treated as the ground truth for comparing various estimates;

Finally, we apply the transformation in Section 4.1.1 to the evaluation data to obtain a partially labeled set (exploration history), from which DM, IPS and DR estimates are computed.
Both DM and DR require estimating the expected conditional loss for a given . We use a linear loss model: , parameterized by weight vectors
, and use leastsquares ridge regression to fit
based on the training set.Step 4 of the above protocol is repeated times, and the resulting bias and rmse (root mean squared error) are reported in Figure 1.
As predicted by analysis, both IPS and DR are unbiased, since the estimator is perfect. In contrast, the linear loss model fails to capture the classification error accurately, and as a result, DM suffers a much larger bias.
While IPS and DR estimators are unbiased, it is apparent from the rmse plot that the DR estimator enjoys a lower variance, which translates into a smaller rmse. As we shall see next, this has a substantial effect on the quality of policy optimization.
4.1.3 Policy optimization
This subsection deviates from much of the paper to study policy optimization rather than policy evaluation. Given a space of possible policies, policy optimization is a procedure that searches this space for the policy with the highest value. Since policy values are unknown, the optimization procedure requires access to exploration data and uses a policy evaluator as a subroutine. Given the superiority of DR over DM and IPS for policy evaluation (in previous subsection), a natural question is whether a similar benefit can be translated into policy optimization as well. Since DM is significantly worse on all datasets, as indicated in Figure 1, we focus on the comparison between IPS and DR.
Here, we apply the data transformation in Section 4.1.1 to the training data, and then learn a classifier based on the loss estimated by IPS and DR, respectively. Specifically, for each dataset, we repeat the following steps times:

[3.]

We randomly split data into training () and test () sets;

We apply the transformation in Section 4.1.1 to the training data to obtain a partially labeled set (exploration history);

We then use the IPS and DR estimators to impute unrevealed losses in the training data; that is, we transform each partialfeedback example
into a cost sensitive example of the form where is the loss for action , imputed from the partial feedback data as follows:In both cases, (recall that ); in DR we use the loss estimate (described below), in IPS we use ;

Two costsensitive multiclass classification algorithms are used to learn a classifier from the losses completed by either IPS or DR: the first is DLM used also in the previous section (see Appendix D and McAllester, Hazan and Keshet, 2011), the other is the Filter Tree reduction of Beygelzimer, Langford and Ravikumar (2008)
applied to a decisiontree base learner (see Appendix
E for more details); 
Finally, we evaluate the learned classifiers on the test data to obtain classification error.
Again, we use leastsquares ridge regression to build a linear loss estimator: . However, since the training data is partially labeled, is fitted only using training data for which . Note that this choice slightly violates our assumptions, because is not independent of the training data . However, we expect the dependence to be rather weak, and we find this approach to be more realistic in practical scenarios where one might want to use all available data to form the reward estimator, for instance due to data scarcity.
Average classification errors (obtained in Step 5 above) of runs are plotted in Figure 2. Clearly, for policy optimization, the advantage of the DR is even greater than for policy evaluation. In all datasets, DR provides substantially more reliable loss estimates than IPS, and results in significantly improved classifiers.
Figure 2 also includes classification error of the Offset Tree reduction (Beygelzimer and Langford, 2009), which is designed specifically for policy optimization with partially labeled data.^{3}^{3}3We used decision trees as the base learner in Offset Trees to parallel our base learner choice in Filter Trees. The numbers reported here are not identical to those by Beygelzimer and Langford (2009), even though we used a similar protocol on the same datasets, probably because of small differences in the data structures used. While the IPS versions of DLM and Filter Tree are rather weak, the DR versions are competitive with Offset Tree in all datasets, and in some cases significantly outperform Offset Tree.
Our experiments show that DR provides similar improvements in two very different algorithms, one based on gradient descent, the other based on tree induction, suggesting the DR technique is generally useful when combined with different algorithmic choices.
4.2 Estimating the Average Number of User Visits
The next problem we consider is estimating the average number of user visits to a popular Internet portal. We formulate this as a regression problem and in our evaluation introduce an artificial covariate shift. As in the previous section, the original data is realworld, but the covariate shift is simulated.
Real user visits to the website were recorded for about million bcookies^{4}^{4}4A bcookie is a unique string that identifies a user. Strictly speaking, one user may correspond to multiple bcookies, but for simplicity we equate a bcookie with a user. randomly selected from all bcookies during March 2010. Each bcookie is associated with a sparse binary covariate vector in dimensions. These covariates describe browsing behavior as well as other information (such as age, gender and geographical location) of the bcookie. We chose a fixed time window in March 2010 and calculated the number of visits by each selected bcookie during this window. To summarize, the dataset contains data points: , where is the th (unique) bcookie, is the corresponding binary covariate vector, and
is the number of visits (the response variable); we treat the empirical distribution over
as the ground truth.If it is possible to sample uniformly at random from and measure the corresponding value , the sample mean of will be an unbiased estimate of the true average number of user visits, which is in this problem. However, in various situations, it may be difficult or impossible to ensure a uniform sampling scheme due to practical constraints. Instead, the best that one can do is to sample from some other distribution (e.g., allowed by the business constraints) and measure the corresponding value . In other words, the sampling distribution of is changed, but the conditional distribution of given remains the same. In this case, the sample average of may be a biased estimate of the true quantity of interest. This setting is known as covariate shift (Shimodaira, 2000), where data are missing at random (see Kang and Schafer, 2007, for related comparisons).
Covariate shift can be modeled as a contextual bandit problem with 2 actions: action corresponding to “conceal the response” and action corresponding to “reveal the response.” Below we specify the stationary exploration policy . The contextual bandit data is generated by first sampling , then choosing an action , and observing the reward (i.e., reward is only revealed if ). The exploration policy determines the covariate shift. The quantity of interest, , corresponds to the value of the constant policy which always chooses “reveal the response.”
To define the exploration sampling probabilities , we adopted an approach similar to Gretton et al. (2008), with a bias toward the smaller values along the first principal component of the distribution over . In particular, we obtained the first principal component (denoted ) of all covariate vectors , and projected all data onto . Let
be the density of a univariate normal distribution with mean
, where is the minimum and is the mean of the projected values. We set .To control the size of exploration data, we randomly subsampled a fraction , , , , , from the entire dataset and then chose actions according to the exploration policy. We then calculated the IPS and DR estimates on this subsample, assuming perfect logging, that is, .^{5}^{5}5Assuming perfect knowledge of exploration probabilities is fair when we compare IPS and DR. However, it does not give implications of how DR compares against DM when there is an estimation error in . The whole process was repeated times.
The DR estimator required building a reward model , which, for a given covariate vector and , predicted the average number of visits (and for was equal to zero). Again, leastsquares ridge regression was used on a separate dataset to fit a linear model from the exploration data.
Figure 3 summarizes the estimation error of the two methods with increasing exploration data size. For both IPS and DR, the estimation error goes down with more data. In terms of rmse, the DR estimator is consistently better than IPS, especially when dataset size is smaller. The DR estimator often reduces the rmse by a fraction between and , and on average by . By comparing to the bias values (which are much smaller), it is clear that DR’s gain of accuracy comes from a lower variance, which accelerates convergence of the estimator to the true value. These results confirm our analysis that DR tends to reduce variance provided that a reasonable reward estimator is available.
4.3 Content Slotting in Response to User Queries
In this section, we compare our estimators on a proprietary realworld dataset consisting of web search queries. In response to a search query, the search engine returns a set of search results. A search result can be of various types such as a weblink, a news snippet or a movie information snippet. We will be evaluating policies that decide which among the different result types to present at the first position. The reward is meant to capture the relevance for the user. It equals if the user clicks on the result at the first position, if the user clicks on some result below the first position, and otherwise (for instance, if the user leaves the search page, or decides to rewrite the query). We call this a clickskip reward.
Our partially labeled dataset consists of tuples of the form , where is the covariate vector (a sparse, highdimensional representation of the terms of the query as well as other contextual information, such as user information), weblink, news, movie is the type of result at the first position, is the clickskip reward, and is the recorded probability with which the exploration policy chose the given result type. Note that due to practical constraints, the values do not always exactly correspond to and should be really viewed as the “best effort” approximation of perfect logging. We still expect them to be highly accurate, so we use the estimator .
The page views corresponding to these tuples represent a small percentage of user traffic to a major website; any visit to the website had a small chance of being part of this experiment. Data was collected over a span of several days during July 2011. It consists of 1.2 million tuples, out of which the first 1 million were used for estimating (training data) with the remainder used for policy evaluation (evaluation data). The evaluation data was further split into 10 independent subsets of equal size, which were used to estimate variance of the compared estimators.
We estimated the value of two policies: the exploration policy itself, and the argmax policy (described below). Evaluating exploration policy on its own exploration data (we call this setup selfevaluation) serves as a sanity check. The argmax policy is based on a linear estimator (in general different from ), and chooses the action with the largest predicted reward (hence the name). We fitted
on training data by importanceweighted linear regression with importance weights
. Note that both and are linear estimators obtained from the same training set, but was computed without importance weights and we therefore expect it to be more biased.Table 2 contains the comparison of IPS, DM and DR, for both policies under consideration. For business reasons, we do not report the estimated reward directly, but normalize to either the empirical average reward (for selfevaluation) or the IPS estimate (for the argmax policy evaluation).
The experimental results are generally in line with theory. The variance is smallest for DR, although IPS does surprisingly well on this dataset, presumably because is not sufficiently accurate. The Direct Method (DM) has an unsurprisingly large bias. If we divide the listed standard deviations by
, we obtain standard errors, suggesting that DR has a slight bias (on selfevaluation where we know the ground truth). We believe that this is due to imperfect logging.
5 Evaluation of Nonstationary Policies
5.1 Problem Definition
The contextual bandit setting can also be used to model a broad class of sequential decisionmaking problems, where the decision maker adapts her actionselection policy over time, based on her observed history of contextactionreward triples. In contrast to policies studied in the previous two sections, such a policy depends on both the current context and the current history and is therefore nonstationary.
In the personalized news recommendation example (Li et al., 2010), a learning algorithm chooses an article (an action) for the current user (the context), with the need for balancing exploration and exploitation. Exploration corresponds to presenting articles about which the algorithm does not yet have enough data to conclude if they are of interest to a particular type of user. Exploitation corresponds to presenting articles for which the algorithm collected enough data to know that they elicit a positive response. At the beginning, the algorithm may pursue more aggressive exploration since it has a more limited knowledge of what the users like. As more and more data is collected, the algorithm eventually converges to a good recommendation policy and performs more exploitation. Obviously, for the same user, the algorithm may choose different articles in different stages, so the policy is not stationary. In machine learning terminology, such adaptive procedures are called online learning algorithms. Evaluating performance of an online learning algorithm (in terms of average perstep reward when run for steps) is an important problem in practice. Online learning algorithms are specific instances of nonstationary policies.
Formally, a nonstationary randomized policy is described by a conditional distribution of choosing an action on a context , given the history of past observations
We use the index (instead of ), and write (instead of ) to make clear the distinction between the histories experienced by the target policy versus the exploration policy .
A target history of length is denoted . In our analysis, we extend the target policy
into a probability distribution over
defined by the factoringSimilarly to , we define shorthands , , . The goal of nonstationary policy evaluation is to estimate the expected cumulative reward of policy after rounds:
In the news recommendation example, indicates whether a user clicked on the recommended article, and is the expected number of clicks garnered by an online learning algorithm after serving user visits. A more effective learning algorithm, by definition, will have a higher value (Li et al., 2010).
Again, to have unbiased policy evaluation, we assume that if for any (and some history ) then for all (and all possible histories ). This clearly holds for instance if for all .
In our analysis of nonstationary policy evaluation, we assume perfect logging, that is, we assume access to probabilities
Whereas in general this assumption does not hold, it is realistic in some applications such as those on the Internet. For example, when a website chooses one news article from a pool to recommend to a user, engineers often have full control/knowledge of how to randomize the article selection process (Li et al., 2010; Li et al., 2011).
5.2 Relation to Dynamic Treatment Regimes
The nonstationary policy evaluation problem defined above is closely related to DTR analysis in a longitudinal observational study. Using the same notation, the inference goal in DTR is to estimate the expected sum of rewards by following a possibly randomized rule for steps.^{6}^{6}6In DTR often the goal is to estimate the expectation of a composite outcome that depends on the entire length trajectory. However, the objective of composite outcomes can easily be reformulated as a sum of properly redefined rewards. Unlike contextual bandits, there is no assumption on the distribution from which the data is generated. More precisely, given an exploration policy , the data generation is described by
Compared to the datageneration process in contextual bandits (see Section 3.1), one allows the laws of and to depend on history . The target policy is subject to the same conditional laws. The setting in longitudinal observational studies is therefore more general than contextual bandits.
IPSstyle estimators (such as DR of the previous section) can be extended to handle nonstationary policy evaluation, where the likelihood ratios are now the ratios of likelihoods of the whole length trajectories. In DTR analysis, it is often assumed that the number of trajectories is much larger than . Under this assumption and with small, the variance of IPSstyle estimates is on the order of , diminishing to as .
In contextual bandits, one similarly assumes . However, the number of steps is often large, ranging from hundreds to millions. The likelihood ratio for a length trajectory can be exponential in , resulting in exponentially large variance. As a concrete example, consider the case where the exploration policy (i.e., the treatment mechanism) chooses actions uniformly at random from possibilities, and where the target policy is a deterministic function of the current history and context. The likelihood ratio of any trajectory is exactly , and there are trajectories (by breaking into pieces of length ). Assuming bounded variance of rewards, the variance of IPSstyle estimators given data is , which can be extremely large (or even vacuous) for even moderate values of , such as those in the studies of online learning in the Internet applications.
In contrast, the “replay” approach of Li et al. (2011) takes advantage of the independence between and history . It has a variance of , ignoring logarithmic terms, when the exploration policy is uniformly random. When the exploration data is generated by a nonuniformly random policy, one may apply rejection sampling to simulate uniformly random exploration, obtaining a subset of the exploration data, which can then be used to run the replay approach. However, this method may discard a large fraction of data, especially when the historical actions in the log are chosen from a highly nonuniform distribution, which can yield an unacceptably large variance. The next subsection describes an improved replaybased estimator that uses doublyrobust estimation as well as a variant of rejection sampling.
5.3 A Nonstationary Policy Evaluator
Our replaybased nonstationary policy evaluator (Algorithm 1) takes advantage of high accuracy of DR estimator while tackling nonstationarity via rejection sampling. We substatially improve sample use (i.e., acceptance rate) in rejection sampling while only modestly increasing the bias. This algorithm is referred to as DRns, for “doubly robust nonstationary.” Over the run of the algorithm, we process the exploration history and run rejection sampling [Steps LABEL:step:sample–LABEL:step:accept] to create a simulated history of the interaction between the target policy and the environment. If the algorithm manages to simulate steps of history, it exits and returns an estimate of the cumulative reward , and an estimate of the average reward ; otherwise, it reports failure indicating not enough data is available.
Since we assume , the algorithm fails with a small probability as long as the exploration policy does not assign too small probabilities to actions. Specifically, let be a lower bound on the acceptance probability in the rejection sampling step; that is, the condition in Step LABEL:step:accept succeeds with probability at least . Then, using the Hoeffding’s inequality, one can show that the probability of failure of the algorithm is at most if
Note that the algorithm returns one “sample” of the policy value. In reality, the algorithm continuously consumes a stream of data, outputs a sample of policy value whenever a length history is simulated, and finally returns the average of these samples. Suppose we aim to simulate histories of length . Again, by Hoeffding’s inequality, the probability of failing to obtain trajectories is at most if
Compared with naive rejection sampling, our approach differs in two respects. First, we use not only the accepted samples, but also the rejected ones to estimate the expected reward with a DR estimator [see Step LABEL:step:DR:k]. As we will see below, the value of is in expectation equal to the total number of exploration samples used while simulating the th action of the target policy. Therefore, in Step LABEL:step:update, we effectively take an average of estimates of , decreasing the variance of the final estimator. This is in addition to lower variance due to the use of the doubly robust estimate in Step LABEL:step:DR:k.
The second modification is in the control of the acceptance rate (i.e., the bound above). When simulating the th action of the target policy, we accept exploration samples with a probability where is a multiplier [see Steps LABEL:step:sample–LABEL:step:accept]. We will see below that the bias of the estimator is controlled by the probability that exceeds 1, or equivalently, that falls below
. As a heuristic toward controlling this probability, we maintain a set
consisting of observed density ratios
Comments
There are no comments yet.