1 Introduction
Effective learning methods for optimizing policies based on logged userinteraction data have the potential to revolutionize the process of building better interactive systems. Unlike the industry standard of using expert judgments for training, such learning methods could directly optimize usercentric performance measures, they would not require interactive experimental control like online algorithms, and they would not be subject to the data bottlenecks and latency inherent in A/B testing.
Recent approaches for offpolicy evaluation and learning in these settings appear promising [1, 2, 4], but highlight the need for accurately logging propensities of the logged actions. With this paper, we provide the first public dataset that contains accurately logged propensities for the problem of Batch Learning from Bandit Feedback (BLBF). We use data from Criteo, a leader in the display advertising space. In addition to providing the data, we propose an evaluation methodology for running BLBF learning experiments and a standardized testbed that allows the research community to systematically investigate BLBF algorithms.
At a high level, a BLBF algorithm operates in the contextual bandit setting and solves the following learning task:

Take as input: . encodes the system from which the logs were collected, denotes the input to the system, denotes the output predicted by the system and is a number encoding the observed online metric for the output that was predicted;

Produce as output: , a new policy that maps ; and

Such that will perform well (according to the metric ) if it were deployed online.
We elaborate on the definitions of as logged in our dataset in the next section. Since past research on BLBF was limited due to the availability of an appropriate dataset, we hope that our testbed will spur research on several aspects of BLBF and offpolicy evaluation, including the following:

New training objectives, learning algorithms, and regularization mechanisms for BLBF;

Improved model selection procedures (analogous to crossvalidation for supervised learning);

Effective and tractable policy classes for the specified task ; and

Algorithms that can scale to massive amounts of data.
The rest of this paper is organized as follows. In Section 2, we describe our standardized testbed for the evaluation of offpolicy learning methods. Then, in Section 3, we describe a set of sanity checks that we used on our dataset to ensure its validity and that can be applied generally when gathering data for offpolicy learning and evaluation. Finally, in Section 4, we show results comparing stateoftheart offpolicy learning methods like doubly robust optimization [3], POEM [2], and reductions to supervised learning using regression baselines. Our results show, for the first time, experimental evidence that recent offpolicy learning methods can improve upon stateoftheart supervised learning techniques on a largescale realworld data set.
2 Dataset
We create our testbed using data from display advertising, similar to the Kaggle challenge hosted by Criteo in 2014 to compare CTR prediction algorithms.^{1}^{1}1https://www.kaggle.com/c/criteodisplayadchallenge However, in this paper, we do not aim to build clickthrough or conversion prediction models for bidding in realtime auctions [5, 6]. Instead, we consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase. This part of the system takes place after the bidding agent has won the auction. In this context, each ad has one of many banner types, which differ in the number of products they contain and in their layout as shown in Figure 1. The task is to choose the products to display in the ad knowing the banner type in order to maximize the number of clicks. This task is thus very different from the Kaggle challenge.
In this setting of choosing the best products to fill the banner ad, we can easily gather exploration data where the placement of the products in the banner ad is randomized, without incurring a prohibitive cost unlike in Web search for which such exploration is much more costly (see, e.g., [7, 8]). Our logging policy uses randomization aggressively, while being very different from a uniformly random policy.
Each banner type corresponds to a different look & feel of the banner ad. Banner ads can differ in the number of products, size, geometry (vertical, horizontal, …), background color and in the data shown (with or without a product description or a call to action); these we call the fixed attributes. Banner types may also have dynamic aspects such as some form of pagination (multiple pages of products) or an animation. Some examples are shown in Figure 1. Throughout the paper, we label positions in each banner type from 1 to from left to right and from top to bottom. Thus 1 is the top left position.
For each user impression, we denote a user context by , the number of slots in the banner type by , and the candidate pool of products by . Each context and product pair is described by features . The input to the system encodes . The logging policy stochastically selects products to construct a banner by first computing nonnegative scores for all candidate products , and using a PlackettLuce ranking model (i.e., sampling without replacement from the multinomial distribution defined by the scores):
(1) 
The propensity of a chosen banner ad is . With these propensities in hand, we can counterfactually evaluate any bannerfilling policy in an unbiased way using inverse propensity scoring [9].
The following was logged, committing to a single feature encoding and a single that produces the scores for the entire duration of data collection.

Record the selected products sampled from via the PlackettLuce model and its propensity;

Record the click/no click and their location(s) in the banner.
The format of this data is:
example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1} …
${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ...
...
${wasProductMClicked} exid:${exID} ${productFeatM_1}:${vM_1} ...
Each impression is represented by lines where is the cardinality of and the first line is a header containing summary information. Note that the first ${nbSlots} candidates correspond to the displayed products ordered by position (consequently, ${wasProductMClicked} information for all other candidates is irrelevant). There are features. Display features are context features or banner type features, which are constant for all candidate products in a given impression. Each unique quadruplet of feature IDs correspond to a unique banner type. Product features are based on the similarity and/or complementarity of the candidate products with historical products seen by the user on the advertiser’s website. We also included interaction terms between some of these features directly in the dataset to limit the amount of feature engineering required to get a good policy. Features 1 and 2 are numerical, while all other features are categorical. Some categorical features are multivalued, which means that they can take more than one value for the same product (order does not matter). Note that the example ID is increasing with time, allowing temporal slices for evaluation [10], although we do not enforce this for our testbed. Importantly, nonclicked examples were subsampled aggressively to reduce the dataset size and we kept only a random subsample of them. So, one needs to account for this during learning and evaluation – the evaluator we provide with the testbed accounts for this subsampling.
The result is a dataset of over million ad impressions. In this dataset, we have:

banner types with the top banner types representing of the total number of ad impressions, the top about , and the top about .

The number of displayed products is between and included.

There are over impressions for slot banners, over for slot, almost for slot, for slot, for slot and over for slot banners.

The size of the candidate pool is about times (upper bound) larger than the number of products to display in the ad.
This dataset is hosted on Amazon AWS (35GB gzipped / 256GB raw). Details for accessing and processing the data are available at http://www.cs.cornell.edu/~adith/Criteo/.
3 Sanity Checks
The workhorse of counterfactual evaluation is Inverse Propensity Scoring (IPS) [11, 9]
. IPS requires accurate propensities, and, to a crude approximation, produces estimates with variance that scales roughly with the range of the inverse propensities. In Table
1, we report the number of impressions and the average and largest inverse propensities, partitioned by ${nbSlots}. When constructing confidence intervals for importance weighted estimates like IPS, we often appeal to asymptotic normality of large sample averages
[12]. However, if the inverse propensities are very large relative to the number of samples (as we can see for), the asymptotic normality assumption will probably be violated.
#Slots  1  2  3  4  5  6 

#Impressions  
Avg(InvPropensity)  
Max(InvPropensity) 
There are some simple statistical tests that can be run to detect some issues with inaccurately logged propensities [13]. These arithmetic and harmonic tests, however, require that the candidate actions available for each impression are fixed a priori. In our scenario, we have a contextdependent candidate set that precludes running these tests, so we propose a more general class of diagnostics that can detect some systematic biases and issues in propensitylogged datasets.
Some notation: . The propensity for the logging policy to take the logged action in context is denoted . If the propensities are correctly logged, then the expected importance weight should be for any new bannerfilling policy . Formally, we have the following:
(2) 
The IPS estimate for a new policy is simply:
(3) 
These equations are valid when has full support, as our logging system does: . The selfnormalized estimator [14, 4] is:
(4) 
Remember that we subsampled nonclicked impressions. Subsampling is indicated by the binary random variable
:(5) 
The IPS estimate and the diagnostic above are not computable in our case since they require all datapoints before subsampling. So, we use the following straightforward modification to use only our subsampled datapoints instead.
First, we estimate the number of datapoints before subsampling only using samples where :
(6) 
is an unbiased estimate of
since . Next, consider estimating as:(7) 
Again, . Hence, the sum in the numerator of is, in expectation, , while the normalizing constant is, in expectation, . Ratios of expectations are not equal to the expectation of a ratio, so we expect a small bias in this estimate but it is easy to show that this estimate is asymptotically consistent.
Finally consider estimating as:
(8) 
Again, . The sum in the numerator of is, in expectation, as is the denominator. Again, we expect this estimate to have a small bias but to remain asymptotically consistent. The computable variant of the selfnormalized IPS estimator simply uses the computable and in its definition: .
We use a family of new policies , parametrized by to diagnose and the expected behavior of IPS estimates . The policy behaves like a uniformly random ranking policy with probability , and with probability , behaves like the logging policy. Formally, for an impression with context , possible actions (e.g., rankings of candidate products), and logged action , the probability for choosing under the new policy is:
(9) 
As we vary away from , the new policy looks more different than the logging policy on the logged impressions. In Tables 2,3,4 we report and a confidence interval assuming asymptotic normality, for different choices of . We also report the IPSestimated clickthrough rates for these policies
, their standard error (
CI), and finally, their selfnormalized IPSestimates [14, 4].#Slots  1  2  


confidence intervals under a normal distribution.
#Slots  3  4  


#Slots  5  6  




As we pick policies that differ from the logging policy, we see that the estimated variance of the IPS estimates (as reflected in their approximate confidence intervals) increases. Moreover, the control variate is systematically underestimated. This should caution us to not rely on a single pointestimate (e.g. only IPS or SNIPS). SNIPS can often provide a better biasvariance tradeoff in these estimates, but can fail catastrophically when the variance is very high due to systematic underestimation of . Moreover, in these very highvariance situations (e.g. when and ), the constructed confidence intervals are not reliable — clearly does not lie in the computed intervals. Based on these sanity checks, we focus the evaluation setup in Section 4 on the slot case.
4 Benchmarking Learning Algorithms
4.1 Evaluation
Estimates based on importance sampling have considerable variance when the number of slots increases. We would thus need tens of millions of impressions to estimate the CTR of slotfilling policies with high precision. To limit the risks of people “overfitting to the variance” by querying far away from our logging policy, we propose the following estimates for any policy:

Compute the standard error of the IPS estimate (appealing to asymptotic normality), and report this error as an “approximate confidence interval”.
This is provided in our evaluation software alongside the dataset online. In this way, learning algorithms must reason about bias/variance explicitly to reliably achieve better estimated CTR.
4.2 Methods
Consider a slot banner filling task defined using our dataset. This slice of traffic can be modeled as a logged contextual bandit problem with a small number of arms. This slice is further randomly divided into a trainvalidatetest split. The following methods are benchmarked in the code accompanying this dataset release. All these methods use a linear policy class to map (i.e., score candidates using a linear scorer ), but differ in their training objectives. Their hyperparameters are chosen to maximize on the validation set and their testset estimates are reported in Table 5.

Random: A policy that picks uniformly at random to display.

Regression: A reduction to supervised learning that predicts
for every candidate action. The number of training epochs (ranging from
), regularization for Lasso (ranging from ), and learning rate for SGD () are the hyperparameters. 
IPS: Directly optimizes evaluated on the training split. This implementation uses a reduction to weighted oneagainstall multiclass classification as employed in [3]. The hyperparameters are the same as in the Regression approach.

DRO [3]: Combines the Regression method with IPS using the doubly robust estimator to perform policy optimization. Again uses a reduction to weighted oneagainstall multiclass classification, and uses the same set of hyperparameters.

POEM [2]: Directly trains a stochastic policy following the counterfactual risk minimization principle, thus reasoning about differences in the variance of the IPS estimate . Hyperparameters are variance regularization, regularization, propensity clipping and number of training epochs.
Test set estimates  

Approach  
Random  
Regression  
IPS  
DRO  
POEM 
The results of the learning experiments are summarized in Table 5. For more details and the specifics of the experiment setup, visit the dataset website. Differences in Random and numbers compared to Table 2 are because they are computed on a subset — we do expect their confidence intervals to overlap. We see that the Regression
approach, which loosely corresponds to predicting CTR for each candidate using supervised machine learning, can be substantially improved using many recent offpolicy learning algorithms that effectively use the logged propensities. We also note that very limited hyperparameter tuning was performed for methods like
POEM and DRO — for instance, POEM can conceivably be improved by employing the doubly robust estimator. We leave such algorithmtuning to future work.5 Conclusions
In this paper, we have introduced a standardized testbed to systematically investigate offpolicy learning algorithms using realword data. We presented this testbed, the sanity checks we ran to ensure its validity, and showed results comparing stateoftheart offpolicy learning methods (doubly robust optimization [3] and POEM [2]) to regression baselines on a slot banner filling task. Our results show experimental evidence that recent offpolicy learning methods can improve upon stateoftheart supervised learning techniques on a largescale realworld data set.
These results we presented are for the 1slot banner filling tasks. There are several dimensions in setting up challenging, interesting, relevant offpolicy learning problems on the data collected for future work.
 Size of the action space:

Increase the size of the action space, i.e. of the number of slots in the banner.
 Feedback granularity:

We can use global feedback (was there a click somewhere in the banner), or per item feedback (which item in the banner was clicked).
 Contextualization:

We can learn a separate model for each banner type or learn a contextualized model across multiple banner types.
Acknowledgments
We thank Alexandre Gilotte and Thomas Nedelec at Criteo for their help in creating the dataset. This work was funded in part through NSF Awards IIS1247637, IIS1615706, IIS1513692.
References
 [1] L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson, “Counterfactual reasoning and learning systems: the example of computational advertising.,” Journal of Machine Learning Research, pp. 3207–3260, 2013.
 [2] A. Swaminathan and T. Joachims, “Batch learning from logged bandit feedback through counterfactual risk minimization,” Journal of Machine Learning Research, pp. 1731–1755, 2015.
 [3] M. Dudík, J. Langford, and L. Li, “Doubly robust policy evaluation and learning,” in ICML, pp. 1097–1104, 2011.
 [4] A. Swaminathan and T. Joachims, “The selfnormalized estimator for counterfactual learning,” in NIPS, pp. 3231–3239, 2015.
 [5] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalable response prediction for display advertising,” Transactions on Intelligent Systems and Technology, p. Article 61, 2014.
 [6] F. Vasile, D. Lefortier, and O. Chapelle, “Costsensitive learning for utility optimization in online advertising auctions,” arXiv preprint arXiv:1603.03713, 2016.
 [7] A. Vorobev, D. Lefortier, G. Gusev, and P. Serdyukov, “Gathering additional feedback on search results by multiarmed bandits with respect to production ranking,” in WWW, pp. 1177–1187, 2015.
 [8] D. Lefortier, P. Serdyukov, and M. de Rijke, “Online exploration for detecting shifts in fresh intent,” in CIKM, pp. 589–598, 2014.
 [9] L. Li, W. Chu, J. Langford, and X. Wang, “Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms,” in WSDM, pp. 297–306, 2011.
 [10] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al., “Ad click prediction: a view from the trenches,” in KDD, pp. 1222–1230, 2013.
 [11] P. R. Rosenbaum and D. B. Rubin, “The central role of the propensity score in observational studies for causal effects,” Biometrika, pp. 41–55, 1983.
 [12] A. B. Owen, Monte Carlo theory, methods and examples. 2013.
 [13] L. Li, S. Chen, J. Kleban, and A. Gupta, “Counterfactual estimation and optimization of click metrics in search engines: A case study,” in WWW, pp. 929–934, 2015.
 [14] T. Hesterberg, “Weighted average importance sampling and defensive mixture distributions,” Technometrics, pp. 185–194, 1995.