Log In Sign Up

Large-scale Validation of Counterfactual Learning Methods: A Test-Bed

The ability to perform effective off-policy learning would revolutionize the process of building better interactive systems, such as search engines and recommendation systems for e-commerce, computational advertising and news. Recent approaches for off-policy evaluation and learning in these settings appear promising. With this paper, we provide real-world data and a standardized test-bed to systematically investigate these algorithms using data from display advertising. In particular, we consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase. This paper presents our test-bed, the sanity checks we ran to ensure its validity, and shows results comparing state-of-the-art off-policy learning methods like doubly robust optimization, POEM, and reductions to supervised learning using regression baselines. Our results show experimental evidence that recent off-policy learning methods can improve upon state-of-the-art supervised learning techniques on a large-scale real-world data set.


Deep Reinforcement Learning for Online Advertising in Recommender Systems

With the recent prevalence of Reinforcement Learning (RL), there have be...

An Ensemble-based Approach to Click-Through Rate Prediction for Promoted Listings at Etsy

Etsy is a global marketplace where people across the world connect to ma...

Learning to Advertise with Adaptive Exposure via Constrained Two-Level Reinforcement Learning

For online advertising in e-commerce, the traditional problem is to assi...

Scalable representation learning and retrieval for display advertising

Over the past decades, recommendation has become a critical component of...

Learning to Advertise for Organic Traffic Maximization in E-Commerce Product Feeds

Most e-commerce product feeds provide blended results of advertised prod...

Non-linear Label Ranking for Large-scale Prediction of Long-Term User Interests

We consider the problem of personalization of online services from the v...

Optimization Approaches for Counterfactual Risk Minimization with Continuous Actions

Counterfactual reasoning from logged data has become increasingly import...

1 Introduction

Effective learning methods for optimizing policies based on logged user-interaction data have the potential to revolutionize the process of building better interactive systems. Unlike the industry standard of using expert judgments for training, such learning methods could directly optimize user-centric performance measures, they would not require interactive experimental control like online algorithms, and they would not be subject to the data bottlenecks and latency inherent in A/B testing.

Recent approaches for off-policy evaluation and learning in these settings appear promising [1, 2, 4], but highlight the need for accurately logging propensities of the logged actions. With this paper, we provide the first public dataset that contains accurately logged propensities for the problem of Batch Learning from Bandit Feedback (BLBF). We use data from Criteo, a leader in the display advertising space. In addition to providing the data, we propose an evaluation methodology for running BLBF learning experiments and a standardized test-bed that allows the research community to systematically investigate BLBF algorithms.

At a high level, a BLBF algorithm operates in the contextual bandit setting and solves the following learning task:

  1. Take as input: . encodes the system from which the logs were collected, denotes the input to the system, denotes the output predicted by the system and is a number encoding the observed online metric for the output that was predicted;

  2. Produce as output: , a new policy that maps ; and

  3. Such that will perform well (according to the metric ) if it were deployed online.

We elaborate on the definitions of as logged in our dataset in the next section. Since past research on BLBF was limited due to the availability of an appropriate dataset, we hope that our test-bed will spur research on several aspects of BLBF and off-policy evaluation, including the following:

  1. New training objectives, learning algorithms, and regularization mechanisms for BLBF;

  2. Improved model selection procedures (analogous to cross-validation for supervised learning);

  3. Effective and tractable policy classes for the specified task ; and

  4. Algorithms that can scale to massive amounts of data.

The rest of this paper is organized as follows. In Section 2, we describe our standardized test-bed for the evaluation of off-policy learning methods. Then, in Section 3, we describe a set of sanity checks that we used on our dataset to ensure its validity and that can be applied generally when gathering data for off-policy learning and evaluation. Finally, in Section 4, we show results comparing state-of-the-art off-policy learning methods like doubly robust optimization [3], POEM [2], and reductions to supervised learning using regression baselines. Our results show, for the first time, experimental evidence that recent off-policy learning methods can improve upon state-of-the-art supervised learning techniques on a large-scale real-world data set.

2 Dataset

We create our test-bed using data from display advertising, similar to the Kaggle challenge hosted by Criteo in 2014 to compare CTR prediction algorithms.111 However, in this paper, we do not aim to build clickthrough or conversion prediction models for bidding in real-time auctions [5, 6]. Instead, we consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase. This part of the system takes place after the bidding agent has won the auction. In this context, each ad has one of many banner types, which differ in the number of products they contain and in their layout as shown in Figure 1. The task is to choose the products to display in the ad knowing the banner type in order to maximize the number of clicks. This task is thus very different from the Kaggle challenge.

In this setting of choosing the best products to fill the banner ad, we can easily gather exploration data where the placement of the products in the banner ad is randomized, without incurring a prohibitive cost unlike in Web search for which such exploration is much more costly (see, e.g., [7, 8]). Our logging policy uses randomization aggressively, while being very different from a uniformly random policy.

Figure 1: Four examples of ads used in display advertising: a vertical ad, a grid, and two horizontal ads (mock-ups).

Each banner type corresponds to a different look & feel of the banner ad. Banner ads can differ in the number of products, size, geometry (vertical, horizontal, …), background color and in the data shown (with or without a product description or a call to action); these we call the fixed attributes. Banner types may also have dynamic aspects such as some form of pagination (multiple pages of products) or an animation. Some examples are shown in Figure 1. Throughout the paper, we label positions in each banner type from 1 to from left to right and from top to bottom. Thus 1 is the top left position.

For each user impression, we denote a user context by , the number of slots in the banner type by , and the candidate pool of products by . Each context and product pair is described by features . The input to the system encodes . The logging policy stochastically selects products to construct a banner by first computing non-negative scores for all candidate products , and using a Plackett-Luce ranking model (i.e., sampling without replacement from the multinomial distribution defined by the scores):


The propensity of a chosen banner ad is . With these propensities in hand, we can counterfactually evaluate any banner-filling policy in an unbiased way using inverse propensity scoring [9].

The following was logged, committing to a single feature encoding and a single that produces the scores for the entire duration of data collection.

  • Record the feature vector

    for all products in the candidate set ;

  • Record the selected products sampled from via the Plackett-Luce model and its propensity;

  • Record the click/no click and their location(s) in the banner.

The format of this data is:

example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1} …

${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ...


${wasProductMClicked} exid:${exID} ${productFeatM_1}:${vM_1} ...

Each impression is represented by lines where is the cardinality of and the first line is a header containing summary information. Note that the first ${nbSlots} candidates correspond to the displayed products ordered by position (consequently, ${wasProductMClicked} information for all other candidates is irrelevant). There are features. Display features are context features or banner type features, which are constant for all candidate products in a given impression. Each unique quadruplet of feature IDs correspond to a unique banner type. Product features are based on the similarity and/or complementarity of the candidate products with historical products seen by the user on the advertiser’s website. We also included interaction terms between some of these features directly in the dataset to limit the amount of feature engineering required to get a good policy. Features 1 and 2 are numerical, while all other features are categorical. Some categorical features are multi-valued, which means that they can take more than one value for the same product (order does not matter). Note that the example ID is increasing with time, allowing temporal slices for evaluation [10], although we do not enforce this for our test-bed. Importantly, non-clicked examples were sub-sampled aggressively to reduce the dataset size and we kept only a random sub-sample of them. So, one needs to account for this during learning and evaluation – the evaluator we provide with the test-bed accounts for this sub-sampling.

The result is a dataset of over million ad impressions. In this dataset, we have:

  • banner types with the top banner types representing of the total number of ad impressions, the top about , and the top about .

  • The number of displayed products is between and included.

  • There are over impressions for -slot banners, over for -slot, almost for -slot, for -slot, for -slot and over for -slot banners.

  • The size of the candidate pool is about times (upper bound) larger than the number of products to display in the ad.

This dataset is hosted on Amazon AWS (35GB gzipped / 256GB raw). Details for accessing and processing the data are available at

3 Sanity Checks

The work-horse of counterfactual evaluation is Inverse Propensity Scoring (IPS) [11, 9]

. IPS requires accurate propensities, and, to a crude approximation, produces estimates with variance that scales roughly with the range of the inverse propensities. In Table 


, we report the number of impressions and the average and largest inverse propensities, partitioned by ${nbSlots}. When constructing confidence intervals for importance weighted estimates like IPS, we often appeal to asymptotic normality of large sample averages

[12]. However, if the inverse propensities are very large relative to the number of samples (as we can see for

), the asymptotic normality assumption will probably be violated.

#Slots 1 2 3 4 5 6
Table 1: Number of impressions and propensity statistics computed for slices of traffic with -slot banners, . Estimated sample size () corrects for sub-sampling of unclicked impressions.

There are some simple statistical tests that can be run to detect some issues with inaccurately logged propensities [13]. These arithmetic and harmonic tests, however, require that the candidate actions available for each impression are fixed a priori. In our scenario, we have a context-dependent candidate set that precludes running these tests, so we propose a more general class of diagnostics that can detect some systematic biases and issues in propensity-logged datasets.

Some notation: . The propensity for the logging policy to take the logged action in context is denoted . If the propensities are correctly logged, then the expected importance weight should be for any new banner-filling policy . Formally, we have the following:


The IPS estimate for a new policy is simply:


These equations are valid when has full support, as our logging system does: . The self-normalized estimator [14, 4] is:


Remember that we sub-sampled non-clicked impressions. Sub-sampling is indicated by the binary random variable



The IPS estimate and the diagnostic above are not computable in our case since they require all data-points before sub-sampling. So, we use the following straightforward modification to use only our sub-sampled data-points instead.

First, we estimate the number of data-points before sub-sampling only using samples where :


is an unbiased estimate of

since . Next, consider estimating as:


Again, . Hence, the sum in the numerator of is, in expectation, , while the normalizing constant is, in expectation, . Ratios of expectations are not equal to the expectation of a ratio, so we expect a small bias in this estimate but it is easy to show that this estimate is asymptotically consistent.

Finally consider estimating as:


Again, . The sum in the numerator of is, in expectation, as is the denominator. Again, we expect this estimate to have a small bias but to remain asymptotically consistent. The computable variant of the self-normalized IPS estimator simply uses the computable and in its definition: .

We use a family of new policies , parametrized by to diagnose and the expected behavior of IPS estimates . The policy behaves like a uniformly random ranking policy with probability , and with probability , behaves like the logging policy. Formally, for an impression with context , possible actions (e.g., rankings of candidate products), and logged action , the probability for choosing under the new policy is:


As we vary away from , the new policy looks more different than the logging policy on the logged impressions. In Tables 2,3,4 we report and a confidence interval assuming asymptotic normality, for different choices of . We also report the IPS-estimated clickthrough rates for these policies

, their standard error (

CI), and finally, their self-normalized IPS-estimates [14, 4].

#Slots 1 2

Table 2: Diagnostics and IPS-estimated clickthrough rates for different policies evaluated on slices of traffic with -slot banners, . interpolates between the logging policy () and the uniform random policy (). Error bars are

confidence intervals under a normal distribution.

#Slots 3 4

Table 3: Diagnostics for different policies evaluated on slices of traffic with -slot banners, . Error bars are confidence intervals under a normal distribution.
#Slots 5 6

Table 4: Diagnostics for different policies evaluated on slices of traffic with -slot banners, . Error bars are confidence intervals under a normal distribution.

As we pick policies that differ from the logging policy, we see that the estimated variance of the IPS estimates (as reflected in their approximate confidence intervals) increases. Moreover, the control variate is systematically under-estimated. This should caution us to not rely on a single point-estimate (e.g. only IPS or SNIPS). SNIPS can often provide a better bias-variance trade-off in these estimates, but can fail catastrophically when the variance is very high due to systematic under-estimation of . Moreover, in these very high-variance situations (e.g. when and ), the constructed confidence intervals are not reliable — clearly does not lie in the computed intervals. Based on these sanity checks, we focus the evaluation set-up in Section 4 on the -slot case.

4 Benchmarking Learning Algorithms

4.1 Evaluation

Estimates based on importance sampling have considerable variance when the number of slots increases. We would thus need tens of millions of impressions to estimate the CTR of slot-filling policies with high precision. To limit the risks of people “over-fitting to the variance” by querying far away from our logging policy, we propose the following estimates for any policy:

  • Report the inverse propensity scoring (IPS) [9] as well as the self-normalized (SN) estimate [4] for the new policy (self-normalized, so that learnt policies cannot cheat by not having their importance weights sum to 1);

  • Compute the standard error of the IPS estimate (appealing to asymptotic normality), and report this error as an “approximate confidence interval”.

This is provided in our evaluation software alongside the dataset online. In this way, learning algorithms must reason about bias/variance explicitly to reliably achieve better estimated CTR.

4.2 Methods

Consider a -slot banner filling task defined using our dataset. This slice of traffic can be modeled as a logged contextual bandit problem with a small number of arms. This slice is further randomly divided into a train-validate-test split. The following methods are benchmarked in the code accompanying this dataset release. All these methods use a linear policy class to map (i.e., score candidates using a linear scorer ), but differ in their training objectives. Their hyper-parameters are chosen to maximize on the validation set and their test-set estimates are reported in Table 5.

  1. Random: A policy that picks uniformly at random to display.

  2. Regression: A reduction to supervised learning that predicts

    for every candidate action. The number of training epochs (ranging from

    ), regularization for Lasso (ranging from ), and learning rate for SGD () are the hyper-parameters.

  3. IPS: Directly optimizes evaluated on the training split. This implementation uses a reduction to weighted one-against-all multi-class classification as employed in [3]. The hyper-parameters are the same as in the Regression approach.

  4. DRO [3]: Combines the Regression method with IPS using the doubly robust estimator to perform policy optimization. Again uses a reduction to weighted one-against-all multi-class classification, and uses the same set of hyper-parameters.

  5. POEM [2]: Directly trains a stochastic policy following the counterfactual risk minimization principle, thus reasoning about differences in the variance of the IPS estimate . Hyper-parameters are variance regularization, regularization, propensity clipping and number of training epochs.

Test set estimates
Table 5: Test set performance of policies learnt using different counterfactual learning baselines. Errors bars are confidence intervals under a normal distribution. Confidence interval for SNIPS is constructed using the delta method [12].

The results of the learning experiments are summarized in Table 5. For more details and the specifics of the experiment setup, visit the dataset website. Differences in Random and numbers compared to Table 2 are because they are computed on a subset — we do expect their confidence intervals to overlap. We see that the Regression

approach, which loosely corresponds to predicting CTR for each candidate using supervised machine learning, can be substantially improved using many recent off-policy learning algorithms that effectively use the logged propensities. We also note that very limited hyper-parameter tuning was performed for methods like

POEM and DRO — for instance, POEM can conceivably be improved by employing the doubly robust estimator. We leave such algorithm-tuning to future work.

5 Conclusions

In this paper, we have introduced a standardized test-bed to systematically investigate off-policy learning algorithms using real-word data. We presented this test-bed, the sanity checks we ran to ensure its validity, and showed results comparing state-of-the-art off-policy learning methods (doubly robust optimization [3] and POEM [2]) to regression baselines on a -slot banner filling task. Our results show experimental evidence that recent off-policy learning methods can improve upon state-of-the-art supervised learning techniques on a large-scale real-world data set.

These results we presented are for the 1-slot banner filling tasks. There are several dimensions in setting up challenging, interesting, relevant off-policy learning problems on the data collected for future work.

Size of the action space:

Increase the size of the action space, i.e. of the number of slots in the banner.

Feedback granularity:

We can use global feedback (was there a click somewhere in the banner), or per item feedback (which item in the banner was clicked).


We can learn a separate model for each banner type or learn a contextualized model across multiple banner types.


We thank Alexandre Gilotte and Thomas Nedelec at Criteo for their help in creating the dataset. This work was funded in part through NSF Awards IIS-1247637, IIS-1615706, IIS-1513692.


  • [1] L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson, “Counterfactual reasoning and learning systems: the example of computational advertising.,” Journal of Machine Learning Research, pp. 3207–3260, 2013.
  • [2] A. Swaminathan and T. Joachims, “Batch learning from logged bandit feedback through counterfactual risk minimization,” Journal of Machine Learning Research, pp. 1731–1755, 2015.
  • [3] M. Dudík, J. Langford, and L. Li, “Doubly robust policy evaluation and learning,” in ICML, pp. 1097–1104, 2011.
  • [4] A. Swaminathan and T. Joachims, “The self-normalized estimator for counterfactual learning,” in NIPS, pp. 3231–3239, 2015.
  • [5] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalable response prediction for display advertising,” Transactions on Intelligent Systems and Technology, p. Article 61, 2014.
  • [6] F. Vasile, D. Lefortier, and O. Chapelle, “Cost-sensitive learning for utility optimization in online advertising auctions,” arXiv preprint arXiv:1603.03713, 2016.
  • [7] A. Vorobev, D. Lefortier, G. Gusev, and P. Serdyukov, “Gathering additional feedback on search results by multi-armed bandits with respect to production ranking,” in WWW, pp. 1177–1187, 2015.
  • [8] D. Lefortier, P. Serdyukov, and M. de Rijke, “Online exploration for detecting shifts in fresh intent,” in CIKM, pp. 589–598, 2014.
  • [9] L. Li, W. Chu, J. Langford, and X. Wang, “Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms,” in WSDM, pp. 297–306, 2011.
  • [10] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al., “Ad click prediction: a view from the trenches,” in KDD, pp. 1222–1230, 2013.
  • [11] P. R. Rosenbaum and D. B. Rubin, “The central role of the propensity score in observational studies for causal effects,” Biometrika, pp. 41–55, 1983.
  • [12] A. B. Owen, Monte Carlo theory, methods and examples. 2013.
  • [13] L. Li, S. Chen, J. Kleban, and A. Gupta, “Counterfactual estimation and optimization of click metrics in search engines: A case study,” in WWW, pp. 929–934, 2015.
  • [14] T. Hesterberg, “Weighted average importance sampling and defensive mixture distributions,” Technometrics, pp. 185–194, 1995.