Off-policy evaluation for slate recommendation

05/16/2016 ∙ by Adith Swaminathan, et al. ∙ Microsoft cornell university 0

This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Semi-synthetic experiments to test several approaches for off-policy evaluation and optimization of slate recommenders.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recommendation systems for e-commerce, online advertising, search, or news, we would like to use the data collected during operation to test new content-serving algorithms (called policies) along metrics such as revenue and number of clicks Bottou et al. (2013); Li et al. (2010). This task is called off-policy evaluation and standard approaches, namely inverse propensity scores (IPS) Dudík et al. (2014); Horvitz and Thompson (1952), require unrealistically large amounts of past data to evaluate whole-page metrics that depend on multiple recommended items, such as when showing ranked lists. Therefore, the industry standard for evaluating new policies is to simply deploy them in weeks-long A/B tests Kohavi et al. (2009). Replacing or supplementing A/B tests with accurate off-policy evaluation, running in seconds instead of weeks, would revolutionize the process of developing better recommendation systems. For instance, we could perform automatic policy optimization (i.e., learn a policy that scores well on whole-page metrics), a task which is currently plagued with bias and an expensive trial-and-error cycle.

The data we collect in these recommendation applications provides only partial information, which is formalized as contextual bandits Auer et al. (2002); Dudík et al. (2014); Langford and Zhang (2008). We study a combinatorial generalization of contextual bandits, where for each context a policy selects a list, called a slate, consisting of component actions. In web search, the context is the search query augmented with a user profile, the slate is the search results page consisting of a list of retrieved documents, and actions are the individual documents. Example metrics are page-level measures such as time-to-success, NDCG (position-weighted relevance) or more general measures of user satisfaction.

The key challenge in off-policy evaluation and optimization is the fact that a new policy, called the target policy

, recommends different slates than those with recorded metrics in our logs. Without structural assumptions on the relationship between slates and observed metrics, we can only hope to evaluate the target policy if its chosen slates occur in the logged past data with a decent probability. Unfortunately, the number of possible slates is combinatorially large, e.g., when recommending

of items, there are ordered sets, so the likelihood of even one match in past data with a target policy is extremely small, leading to a complete breakdown of fully general techniques such as IPS.

To overcome this limitation, some authors Bottou et al. (2013); Swaminathan and Joachims (2015) restrict their logging and target policies to a parameterized stochastic policy class. Others assume specific parametric (e.g., linear) models relating the observed metrics to the features describing a slate Auer (2002); Rusmevichientong and Tsitsiklis (2010); Filippi et al. (2010); Chu et al. (2011); Qin et al. (2014). Yet another paradigm, called semi-bandits, assumes that the slate-level metric is a sum of observed action-level metrics Kale et al. (2010); Kveton et al. (2015).

Figure 1:

Off-policy evaluation of two whole-page user-satisfaction metrics on proprietary search engine data. Average RMSE over 50 runs on a log-log scale. Our method (pseudoinverse or PI) achieves the best performance for moderate data sizes. The unbiased IPS method suffers high variance, and direct modeling (DM) suffers high bias.

OnPolicy is the expensive alternative of deploying the policy. Improvements of PI are significant, with -values in text. Details in Sec. 4.3.

We seek to evaluate arbitrary policies, while avoiding strong assumptions about user behavior, as in parametric bandits, or the nature of feedback, as in semi-bandits. We relax restrictions of both parametric and semi-bandits. Like semi-bandits, we assume that the slate-level metric is a sum of action-level metrics that depend on the context, the action, and the position on the slate, but not on the other actions in the slate. Unlike semi-bandits, these per-action metrics are unobserved by the decision maker

. This model also means that the slate-level metric is linearly related with the unknown vector listing all the per-action metrics in each position. However, this vector of per-action metric values can depend arbitrarily on each context, which precludes fitting a single linear model of rewards (with dimensionality independent of the number of contexts) as usually done in linear bandits.

This paper makes the following contributions:

  • The additive decomposition assumption (ADA): a realistic assumption about the feedback structure in combinatorial contextual bandits, which generalizes contextual, linear, and semi-bandits.

  • The pseudoinverse estimator (PI) for off-policy evaluation: a general-purpose estimator for any stochastic logging policy, unbiased under ADA. The number of logged samples needed for evaluation with error when choosing out of items is typically —an exponential gain over the complexity of other unbiased estimators. We provide careful distribution-dependent bounds based on the overlap between logging and target policies.

  • Experiments on a real-world search ranking dataset: The strong performance of the PI estimator provides, to our knowledge, the first demonstration of high quality off-policy evaluation of whole-page metrics, comprehensively outperforming prior baselines (see Fig. 1).

  • Off-policy optimization: We provide a simple procedure for learning to rank (L2R) using the PI estimator. Our procedure tunes L2R models directly to online metrics by leveraging pointwise supervised L2R approaches, without requiring pointwise feedback.

Without contexts, several authors have studied a similar linear dependence of the reward on action-level metrics Dani et al. (2008); Rusmevichientong and Tsitsiklis (2010). Their approaches compete with the best fixed slate, whereas we focus on evaluating arbitrary context-dependent policies. While they also use the pseudoinverse estimator in their analysis (see, e.g., Lemma 3.2 of Dani et al. (2008)), its role is different. They construct specific distributions to optimize the explore-exploit trade-off, while we provide guarantees for off-policy evaluation with arbitrary logging distributions, requiring a very different analysis and conclusions.

2 Setting and notation

In combinatorial contextual bandits, a decision maker repeatedly interacts as follows:

  • the decision maker observes a context drawn from a distribution over some space ;

  • based on the context, the decision maker chooses a slate consisting of actions , where a position is called a slot, the number of slots is , actions at position come from some space , and the slate is chosen from a set of allowed slates ;

  • given the context and slate, the environment draws a reward from a distribution . Rewards in different rounds are independent, conditioned on contexts and slates.

The context space can be infinite, but the set of actions is of finite size. For simplicity, we assume for all contexts and define as the maximum number of actions per slot. The goal of the decision maker is to maximize the reward.

The decision maker is modeled as a stochastic policy that specifies a conditional distribution (a deterministic policy is a special case). The value of a policy , denoted , is defined as the expected reward when following :

To simplify derivations, we extend the conditional distribution into a distribution over triples as . With this shorthand, we have .

To finish this section, we introduce notation for the expected reward for a given context and slate, which we call the slate value, and denote as .

[Cartesian product] Consider whole-page optimization of a news portal where the reward is the whole-page advertising revenue. The context is the user profile, the slate is the news-portal page with slots corresponding to news sections or topics,111For simplicity, we do not discuss the more general setting of showing multiple articles in each news section. and actions are the news articles. It is natural to assume that each article can only appear in one of the sections, so that if . The set of valid slates is the Cartesian product . The number of valid slates is exponential in , namely, .

[Ranking] Consider information retrieval in web search. Here the context is the user query along with user profile, time of day etc. Actions correspond to search items (such as webpages). The policy chooses of items, where the set of items for a context is chosen from a large corpus by a fixed filtering step (e.g., a database query). We have for all , but the allowed slates have no repeated actions. The slots correspond to positions on the search results page. The number of valid slates is exponential in since . A reward could be the negative time-to-success, i.e., negative of the time taken by the user to find a relevant item, typically capped at some threshold if nothing relevant is found.

2.1 Off-policy evaluation and optimization

In the off-policy setting, we have access to the logged data collected using a past policy , called the logging policy. Off-policy evaluation is the task of estimating the value of a new policy , called the target policy, using the logged data. Off-policy optimization is the harder task of finding a policy that improves upon the performance of and achieves a large reward. We mostly focus on off-policy evaluation, and show how to use it as a subroutine for off-policy optimization in Sec. 4.2.

There are two standard approaches for off-policy evaluation. The direct method

(DM) uses the logged data to train a (parametric) model

to predict the expected reward for a given context and slate. is then estimated as


The direct method is frequently biased because the reward model is typically misspecified.

The second approach, which is provably unbiased (under modest assumptions), is the inverse propensity score (IPS) estimator Horvitz and Thompson (1952). The IPS estimator reweights the logged data according to ratios of slate probabilities under the target and logging policy. It has two common variants:


The two estimators differ only in their normalizer. The IPS estimator is unbiased, whereas the weighted IPS (wIPS) is only asymptotically unbiased, but usually achieves smaller error due to smaller variance. Unfortunately, the variance of both estimators grows linearly with the magnitude of , which can be as bad as . This is prohibitive when .

3 Our approach

To reason about the slates, we consider vectors in whose components are indexed by pairs of slots and possible actions in them. A slate is then described by an indicator vector whose entry at position is equal to 1 if the slate has action in the slot , i.e., if . At the foundation of our approach is an assumption relating the slate value to its component actions:

[ADA] A combinatorial contextual bandit problem satisfies the additive decomposition assumption (ADA) if for each context there exists a (possibly unknown) intrinsic reward vector such that the slate value decomposes as . ADA only posits the existence of intrinsic rewards, not their observability. This distinguishes it from semi-bandits where can be observed for the ’s chosen in context . The slate value is described by a linear relationship between and the unknown “parameters” , but we do not require that be easy to fit from features describing contexts and actions, which is the key departure from the direct method and parametric bandits.

While ADA rules out interactions among different actions on a slate,222 We discuss limitations of ADA and directions to overcome them in Sec. 5. its ability to vary intrinsic rewards arbitrarily across contexts can capture many common metrics in information retrieval, such as the normalized discounted cumulative gain (NDCG) Burges et al. (2005), a common reward metric in web ranking:

[NDCG] For a given slate we first define a discounted cumulative gain value:

where is the relevance of document on query . We define where , so NDCG takes values in . Thus, NDCG satisfies ADA with .

In addition to ADA, we also make the standard assumption that the logging policy puts non-zero probability on all slates that can be potentially chosen by the target policy. This assumption is also required for the unbiasedness of IPS, otherwise off-policy evaluation is impossible (Langford et al., 2008). [ABS] The off-policy evaluation problem satisfies the absolute continuity assumption if whenever with probability one over .

3.1 The pseudoinverse estimator

Our estimator uses certain moments of the logging policy

, called marginal values and denoted , and their empirical estimates, called marginal rewards and denoted :

Recall that is viewed here as a distribution over triples . In words, the components accumulate the rewards only when the policy chooses a slate with

. The random variable

estimates at by the observed reward for the slate displayed for in our logs. The key insight is that the marginal value provides an indirect view of , occluded by the effect of actions in slots . Specifically, from ADA and the definition of , we obtain


Eq. (3) represents a linear relationship between and , which is concisely described by a matrix , with

Thus, . If was invertible, we could write and use ADA to obtain . We could then replace by its unbiased estimate to get an unbiased estimate of . In reality, is not invertible. However, it turns out that the above strategy still works, we just need to replace the inverse by the pseudoinverse: 333 A variant of Theorem 3 is proved in a different context by Dani et al. (2008). Our proof, alongside proofs of all other statements in the paper, is in Appendix. If ADA holds and , then . This gives rise to the value estimator, which we call the pseudoinverse estimator or PI for short:


where in Eq. (4), we have expanded the definition of and introduced the notation for the expected slate indicator under conditional on , . The sum over required to obtain in Eq. (4) can be estimated with a small sample.

Theorem 3 immediately yields the unbiasedness of : If ADA and ABS hold, then the estimator is unbiased, i.e., , where the expectation is over the logged examples sampled i.i.d. from .

[PI when ] When the slate consists of a single slot, the policies recommend a single action chosen from some set for a context . In this case PI coincides with IPS since

[PI when ] When the target policy coincides with logging, the estimator simplifies to the average of rewards: (see Appendix C). For , this follows from the previous example, but it is non-trivial to show for .

[PI for a Cartesian product with uniform logging] The PI estimator for the Cartesian product slate space when is uniform over slates simplifies to

by Prop. D.1 in Appendix D.1. Note that unlike IPS, which divides by probabilities of whole slates, the PI estimator only divides by probabilities of actions appearing in individual slots. Thus, the magnitude of each term of the outer summation is only , whereas the IPS terms are .

[PI for rankings with and uniform logging] In this case, the PI estimator equals

by Prop. D.1 in Appendix D.1. The magnitude of individual terms is again .

3.2 Deviation analysis

We have shown that the pseudoinverse estimator is unbiased given ADA and have also given examples when it improves exponentially over IPS, the existing state-of-the-art. We next derive a distribution-dependent bound on finite-sample error and use it to obtain an exponential improvement over IPS for a broader class of logging distributions.

Our deviation bound is obtained by an application of Bernstein’s inequality, which requires bounding the variance and range of the terms appearing in Eq. (4), namely . We bound their variance and range, respectively, by the following distribution-dependent quantities:


They capture the “average” and “worst-case” mismatch between the logging and target policy. They equal one when (see Appendix C), and in general yield the following deviation bound: Assume that ADA and ABS hold, and let and be defined as in Eq. (5). Then, for any , with probability at least ,

In Appendix D, Prop. D, we show that , so to bound , it suffices to bound . We next show such a bound for a broad class of logging policies defined as follows: Let denote the uniform policy, that is, . We say that a policy is pairwise -uniform for some if for all contexts , actions , and slots , we have

For the Cartesian product slate space, this means that for . For rankings, for . Given any policy, we can obtain a pairwise

-uniform policy by mixing in the uniform distribution with the weight

. Assume that valid slates form a Cartesian product space as in Example 2 or are rankings as in Example 2. Then for any pairwise -uniform logging policy, we have . Thus, using the fact that , Prop. 3.2 and Theorem 3.2 yield , or equivalently logging samples are needed to achieve accuracy .

4 Experiments

We now empirically evaluate the performance of the pseudoinverse estimator in the ranking scenario of Example 2. We first show that our approach compares favorably to baselines in a semi-synthetic evaluation on a public data set under the NDCG metric, which satisfies ADA as discussed in Example 3. On the same data, we further use the pseudoinverse estimator for off-policy optimization, that is, to learn ranking policies, competitively with a supervised baseline that uses more information. Finally, we demonstrate substantial improvements on proprietary data from search engine logs for two user-satisfaction metrics used in practice: time-to-success and utility rate, which are a priori unlikely to (exactly) satisfy ADA. More detailed results are deferred to Appendices E, F and G.

4.1 Semi-synthetic evaluation

Our semi-synthetic evaluation uses labeled data from the LETOR4.0 MQ2008 dataset Qin and Liu (2013) to create a contextual bandit instance. Queries form the contexts and actions are the available documents. The dataset contains 784 queries, 5–121 documents per query and relevance labels for each query-document pair. Each pair has a 47-dimensional feature vector , which can be partitioned into title features , and body features .

To derive a logging policy and a distinct target policy, we first train two lasso regression models, called

and , to predict relevances from and , respectively. To create the logs, queries are sampled uniformly, and the set consists of top documents according to . The logging policy samples from a multinomial distribution over documents in , parameterized by : . Slates are constructed slot-by-slot, sampling without replacement according to . Choosing interpolates between uniformly random and deterministic logging. Our target policy selects the slate of top documents according to . The slate reward is the NDCG metric defined in Example 3.

Figure 2: RMSE under uniform logging () and non-uniform logging ().

We compare our estimator PI with the direct method (DM) and weighted IPS (wIPS, see Eq. 2), which out-performed IPS. Our implementation of DM concatenates per-slot features into , training a reward predictor on the first examples and evaluating using Eq. (1) on the other examples. We experimented with regression trees, ridge and lasso regression for DM, and always report results for the choice with the smallest RMSE at examples. We also include an aspirational baseline, OnPolicy. This corresponds to deploying the target policy as in an A/B test and returning the average of observed rewards. This is the expensive alternative we wish to avoid.

We plot the root mean square error (RMSE) of the estimators as a function of increasing data size over at least 20 independent runs.

In Fig. 2, the first two plots study the RMSE of estimators for two choices of and , given the uniform logging policy (i.e., ). In both cases, the pseudoinverse estimator outperforms wIPS by a factor of 10 or more with high statistical significance, for both plots and for all . The pseudoinverse estimator eventually also outperforms the biased DM with statistical significance, with for both plots at . The cross-over point occurs fairly early () for the smaller slate space, but is one order larger () for the largest slate space. Note that DM’s performance can deteriorate with more data, likely because it optimizes the fit to the reward distribution of , which is different from that of .

As expected, OnPolicy performs the best, requiring between 10x and 100x less data. However, OnPolicy requires to fix the target policy for each data collection, while off-policy methods like PI take advantage of massive amounts of logged data to evaluate arbitrary policies. As an aside, since the user feedback in these experiments is simulated, we can also simulate semi-bandit feedback which reveals the intrinsic reward of each shown action, and use it directly for off-policy evaluation. This is a purely hypothetical baseline: with only page-level feedback, one cannot implement a semi-bandit solution. We compare against this hypothetical baseline in Appendix F.

In Fig. 2 (right panel), we study the effect of the overlap between the logging and target policies, by taking , which results in a better alignment between the logging and target policies, While the RMSE of the pseudoinverse estimator is largely unchanged, both wIPS and DM exhibit some improvement. wIPS enjoys a smaller variance, while DM enjoys a smaller bias due to closer training and target distributions. PI continues to be statistically better than wIPS, with for all , and eventually also better than DM, with starting at . See Appendices E and F for more results and the complete set of -values.

4.2 Semi-synthetic policy optimization

We now show how to use the pseudoinverse estimator for off-policy optimization. We leverage pointwise learning to rank (L2R) algorithms, which learn a scoring function for query-document pairs by fitting to relevance labels. We call this the supervised approach, as it requires relevance labels.

Instead of requiring relevance labels, we use the pseudoinverse estimator to convert page-level reward into per-slot reward components—the estimates of —and these become targets for regression. Thus, the pseudoinverse estimator enables pointwise L2R even without relevance labels. Given a contextual bandit dataset collected by the logging policy , we begin by creating the estimates of : , turning the -th contextual bandit example into regression examples. The trained regression model is used to create a slate, starting with the highest scoring slot-action pair, and continuing greedily (excluding the pairs with the already chosen slots or actions).

We used the MQ2008 dataset from the previous section and created a contextual bandit problem with 5 slots and 20 documents per slot, with a uniformly random logging policy. We chose a standard 5-fold split and always trained on bandit data from 4 folds and evaluated using the supervised data on the fifth. We compare our approach, titled PI-OPT, against the supervised approach, trained to predict the gains, equal to , computed using annotated relevance judgements in the training fold (predicting raw relevances was inferior). Both PI-OPT and SUP train regression trees. We find that PI-OPT is consistently competitive with SUP after seeing about 1K samples containing slate-level feedback, and gets a test NDCG of 0.450 at 1K samples, 0.451 at 10K samples, and 0.456 at 100K samples. SUP achieves a test NDCG of 0.453 by using approximately 12K annotated relevance judgements. We posit that PI-OPT is competitive with SUP because it optimizes the target metric directly, while SUP uses a surrogate (imperfect) regression loss. See Appendix G for detailed results.

4.3 Real-world experiments

We finally evaluate all methods using logs collected from a popular search engine. The dataset consists of search queries, for which the logging policy randomly (non-uniformly) chooses a slate of size from a small pre-filtered set of documents of size . After preprocessing, there are 77 unique queries and 22K total examples, meaning that for each query, we have logged impressions for many of the available slates. To control the query distribution in our experiment, we generate a larger dataset by bootstrap sampling, repeatedly choosing a query uniformly at random and a slate uniformly at random from those shown for this query. Hence, the conditional probability of any slate for a given query matches the frequencies in the original data.

We consider two page-level metrics: time-to-success (TTS) and UtilityRate. TTS measures the number of seconds between presenting the results and the first satisfied click from the user, defined as any click for which the user stays on the linked page for sufficiently long. TTS value is capped and scaled to . UtilityRate

is a more complex page-level metric of user’s satisfaction. It captures the interaction of a user with the page as a timeline of events (such as clicks) and their durations. The events are classified as revealing a positive or negative utility to the user and their contribution is proportional to their duration.

UtilityRate takes values in .

We evaluate a target policy based on a logistic regression classifier trained to predict clicks and using the predicted probabilities to score slates. We restrict the target policy to pick among the slates in our logs, so we know the ground truth slate-level reward. Since we know the query distribution, we can calculate the target policy’s value exactly, and measure RMSE relative to this true value.

We compare our estimator (PI) with three baselines similar to those from Sec. 4.1: DM, IPS and OnPolicy. DM uses regression trees over roughly 20,000 slate-level features.

Fig. 1 from the introduction shows that PI provides a consistent multiplicative improvement in RMSE over IPS, which suffers due to high variance. Starting at moderate sample sizes, PI also outperforms DM, which suffers due to substantial bias. For TTS, the gains over IPS are significant with after 2K samples and for DM with after 20K samples. For UtilityRate, the improvements on IPS are significant with at 60K examples, and over DM with after 20K examples. The complete set of -values is in Appendix E.

5 Discussion

In this paper we have introduced a new assumption (ADA), a new estimator (PI) that exploits this assumption, and demonstrated their significant theoretical and practical merits.

In our experiments, we saw examples of bias-variance trade-off with off-policy estimators. At small sample sizes, the pseudoinverse estimator still has a non-trivial variance. In these regimes, the biased direct method can often be practically useful due to its small variance (if its bias is sufficiently small). Such well-performing albeit biased estimators can be incorporated into the pseudoinverse estimator via the doubly-robust approach (Dudík et al., 2011).

Experiments with real-world data in Sec. 4.3 demonstrate that even when ADA does not hold, the estimators based on ADA can still be applied and tend to be superior to alternatives. We view ADA similarly to the IID assumption: while it is probably often violated in practice, it leads to practical algorithms that remain robust under misspecification. Similarly to the IID assumption, we are not aware of ways for easily testing whether ADA holds.

One promising approach to relax ADA is to posit a decomposition over pairs (or tuples) of slots to capture higher-order interactions such as diversity. More generally, one could replace slate spaces by arbitrary compact convex sets, as done in linear bandits. In these settings, the pseudoinverse estimator could still be applied, but tight sample-complexity analysis is open for future research.


  • Auer (2002) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.

    Journal of Machine Learning Research

    , 2002.
  • Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002.
  • Boman and Hendrickson (2003) Erik G Boman and Bruce Hendrickson. Support theory for preconditioning. SIAM Journal on Matrix Analysis and Applications, 2003.
  • Bottou et al. (2013) Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 2013.
  • Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In International Conference on Machine Learning, 2005.
  • Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In Artificial Intelligence and Statistics, 2011.
  • Dani et al. (2008) Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, 2008.
  • Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In International Conference on Machine Learning, 2011.
  • Dudík et al. (2014) Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 2014.
  • Filippi et al. (2010) Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, 2010.
  • Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 1952.
  • Kale et al. (2010) Satyen Kale, Lev Reyzin, and Robert E Schapire. Non-stochastic bandit slate problems. In Advances in Neural Information Processing Systems, 2010.
  • Kohavi et al. (2009) Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. Controlled experiments on the web: survey and practical guide. Knowledge Discovery and Data mining, 2009.
  • Kveton et al. (2015) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvári. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, 2015.
  • Langford and Zhang (2008) John Langford and Tong Zhang.

    The epoch-greedy algorithm for multi-armed bandits with side information.

    In Advances in Neural Information Processing Systems, 2008.
  • Langford et al. (2008) John Langford, Alexander Strehl, and Jennifer Wortman. Exploration scavenging. In International Conference on Machine Learning, 2008.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, 2010.
  • Petersen et al. (2008) Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 2008.
  • Qin et al. (2014) Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In International Conference on Data Mining, 2014.
  • Qin and Liu (2013) Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. arXiv:1306.2597, 2013.
  • Rusmevichientong and Tsitsiklis (2010) Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 2010.
  • Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, 2015.

Appendix A Proofs of Theorems 3 and 3.1



Consider the matrix . Its element in the row indexed and column indexed equals

The claim follows by taking a conditional expectation with respect to . ∎

Proof of Theorem 3.

Fix one for the entirety of the proof. Recall from Sec. 3.1 that

Let be the size of the support of and let denote the binary matrix with rows for each . Thus is the vector enumerating over for which . Let denote the null space of and be the projection on . Let . Then clearly, , and hence, for any ,


We will now show that , which will complete the proof.

Recall from Sec. 3.1 that


Next note that in symmetric positive semidefinite by Claim A, so

where the first step follows by positive semi definiteness of , the second step is from the expansion of as in Claim A, and the final step from the definition of . Since , we have from Eq. (7) that , but, importantly, this also implies , so by the definition of the pseudoinverse,

This proves Theorem 3, since for any with , we argued that . ∎

Proof of Theorem 3.1.

Note that it suffices to analyze the expectation of a single term in the estimator, that is

First note that , because

The remainder follows by Theorem 3:

Appendix B Proof of Theorem 3.2


The proof is based on an application of Bernstein’s inequality to the centered sum

The fact that this quantity is centered is directly from Theorem 3.1. We must compute both the second moment and the range to apply Bernstein’s inequality. By independence, we can focus on just one term, so we will drop the subscript . First, bound the variance:

Thus the per-term variance is at most . We now bound the range, again focusing on one term,

The first line here is the triangle inequality, coupled with the fact that since rewards are bounded in , so is . The second line is from the definition of , while the third follows because . The final line follows from the definition of .

Now, we may apply Bernstein’s inequality, which says that for any , with probability at least ,

The theorem follows by dividing by . ∎

Appendix C Pseudo-inverse estimator when

In this section we show that when the target policy coincides with logging (i.e., ), we have , i.e., the bound of Theorem 3.2 is independent of the number of actions and slots. Indeed, in Claim C we will see that the estimator actually simplifies to taking an empirical average of rewards which are bounded in . Before proving Claim C we prove one supporting claim:

For any policy and context , we have for all .


To simplify the exposition, write and instead of a more verbose and .

The bulk of the proof is in deriving an explicit expression for . We begin by expressing in a suitable basis. Since is the matrix of second moments and is the vector of first moments of , the matrix can be written as

where is the covariance matrix of , i.e., . Assume that the rank of is

and consider the eigenvalue decomposition of

where and vectors are orthonormal; we have grouped the eigenvalues into the diagonal matrix

and eigenvectors into the matrix


We next argue that . To see this, note that the all-ones-vector is in the null space of because, for any valid slate , we have and thus also for the convex combination we have , which means that

Now, since and , we have that . In particular, we can write in the form


where and is a unit vector. Note that since . Thus, the second moment matrix can be written as


Let denote the middle matrix in the factorization of Eq. (9):


This matrix is a representation of with respect to the basis . Since , the rank of and that of is . Thus, is invertible and


To obtain , we use the following identity (see Petersen et al. (2008)):


where is the Schur complement of . The identity of Eq. (12) holds whenever and its Schur complement are both invertible. In the block representation of Eq. (10), we have and

so Eq. (12) can be applied to obtain :


Next, we will evaluate , using the factorizations in Eqs. (11) and (8), and substituting Eq. (13) for :