1 Introduction
We consider learning a classifier from logged data. Here, the learner has access to a logged labeled dataset that has been collected according to a known predetermined policy, and his goal is to learn a classifier that predicts the labels accurately over the entire population, not just conditioned on the logging policy.
This problem arises frequently in many natural settings. An example is predicting the efficacy of a treatment as a function of patient characteristics based on observed data. Doctors may assign the treatment to patients based on some predetermined rule; recording these patient outcomes produces a logged dataset where outcomes are observed conditioned on the doctors’ assignment. A second example is recidivism prediction, where the goal is to predict whether a convict will reoffend. Judges use their own predefined policy to grant parole, and if parole is granted, then an outcome (reoffense or not) is observed. Thus the observed data records outcomes conditioned on the judges’ parole policy, while the learner’s goal is to learn a predictor over the entire population.
A major challenge in learning from logged data is that the logging policy may leave large areas of the data distribution underexplored. Consequently, empirical risk minimization (ERM) on the logged data leads to classifiers that may be highly suboptimal on the population. When the logging policy is known, a second option is to use a weighted ERM, that reweighs each observed labeled data point to ensure that it reflects the underlying population. However, this may lead to sample inefficiency if the logging policy does not adequately explore essential regions of the population. A final approach, typically used in clinical trials, is controlled random experimentation – essentially, ignore the logged data, and record outcomes for fresh examples drawn from the population. This approach is expensive due to the high cost of trials, and wasteful since it ignores the observed data.
Motivated by these challenges, we propose active learning to combine logged data with a small amount of strategically chosen labeled data that can be used to correct the bias in the logging policy. This solution has the potential to achieve the best of both worlds by limiting experimentation to achieve higher sample efficiency, and by making the most of the logged data. Specifically, we assume that in addition to the logged data, the learner has some additional unlabeled data that he can selectively ask an annotator to label. The learner’s goal is to learn a highly accurate classifier over the entire population by using a combination of the logged data and with as few label queries to the annotator as possible.
How can we utilize logged data for better active learning? This problem has not been studied to the best of our knowledge. A naive approach is to use the logged data to come up with a warm start
and then do standard active learning. In this work, we show that we can do even better. In addition to the warm start, we show how to use multiple importance sampling estimators to utilize the logged data more efficiently. Additionally, we introduce a novel debiasing policy that selectively avoids label queries for those examples that are highly represented in the logged data.
Combining these three approaches, we provide a new algorithm. We prove that our algorithm is statistically consistent, and has a lower label requirement than simple active learning that uses the logged data as a warm start. Finally, we evaluate our algorithm experimentally on various datasets and logging policies. Our experiments show that the performance of our method is either the best or close to the best for a variety of datasets and logging policies. This confirms that active learning to combine logged data with carefully chosen labeled data may indeed yield performance gains.
2 Preliminaries
2.1 Problem Setup
Instances are drawn from an instance space and a label space . There is an underlying data distribution over that describes the population. There is a hypothesis space . For simplicity, we assume is a finite set, but our results can be generalized to VCclasses by standard arguments (Vapnik & Chervonenkis, 1971).
The learning algorithm has access to two sources of data: logged data, and online data. The logged data are generated from examples drawn i.i.d. from , and a logging policy
that determines the probability of observing the label. For each example
(), an independent Bernoulli random variable
is drawn with expectation , and then the label is revealed to the learning algorithm if ^{1}^{1}1Note that this generating process implies the standard unconfoundedness assumption in the counterfactual inference literature: , that is, given the instance , its label is conditionally independent with the action (whether the label is observed).. We call the logged dataset. From the algorithm’s perspective, we assume it knows the logging policy , and only observes instances , decisions of the policy , and revealed labels .The online data are generated as follows. Suppose there is a stream of another examples drawn i.i.d. from distribution . At time (), the algorithm uses its query policy to compute a bit , and then the label is revealed to the algorithm if . The computation of may in general be randomized, and is based on the observed logged data , observed instances , previous decisions, and observed labels .
The goal of the algorithm is to learn a classifier from observed logged data and online data. Fixing , , , , the performance measures are: (1) the error rate of the output classifier, and (2) the number of label queries on the online data. Note that the error rate is over the entire population instead of conditioned on the logging policy, and that we assume the logged data come at no cost. In this work, we are interested in the situation where is about the same as or less than .
2.2 Background on DisagreementBased Active Learning
Our algorithm is based on DisagreementBased Active Learning (DBAL) which has rigorous theoretical guarantees and can be implemented practically (see (Hanneke et al., 2014) for a survey, and (Hanneke & Yang, 2015; Huang et al., 2015) for some recent developments). DBAL iteratively maintains a candidate set of classifiers that contains the optimal classifier with high probability. At the th iteration, the candidate set is constructed as all classifiers which have low estimated error on examples observed up to round . Based on , the algorithm constructs a disagreement set to be a set of instances on which there are at least two classifiers in that predict different labels. Then the algorithm draws a set of unlabeled examples, where the size of is a parameter of the algorithm. For each instance , if it falls into the disagreement region , then the algorithm queries for its label; otherwise, observing that all classifiers in have the same prediction on , its label is not queried. The queried labels are then used to update future candidate sets.
2.3 Background on Error Estimators
Most learning algorithms, including DBAL, require estimating the error rate of a classifier. A good error estimator should be unbiased and of low variance. When instances are observed with different probabilities, a commonly used error estimator is the standard importance sampling estimator that reweighs each observed labeled example according to the inverse probability of observing it.
Consider a simplified setting where the logged dataset and . On the online dataset , the algorithm uses a fixed query policy to determine whether to query for labels, that is, for . Let .
In this setting, the standard importance sampling (IS) error estimator for a classifier is:
(1) 
is unbiased, and its variance is proportional to . Although the learning algorithm can choose its query policy to avoid to be too small for , is the logging policy that cannot be changed. When is small for some , the estimator in (1) have a high variance such that it may be even better to just ignore the logged dataset .
An alternative is the multiple importance sampling (MIS) estimator with balanced heuristic
(Veach & Guibas, 1995):(2) 
It can be proved that
is indeed an unbiased estimator for
. Moreover, as proved in (Owen & Zhou, 2000; Agarwal et al., 2017), (2) always has a lower variance than both (1) and the standard importance sampling estimator that ignores the logged data.In this paper, we use multiple importance sampling estimators, and write as .
Additional Notations
In this paper, unless otherwise specified, all probabilities and expectations are over the distribution , and we drop from subscripts henceforth.
Let be the disagreement mass between and , and for be the empirical disagreement mass between and on .
For any , , define to be ball around . For any , define the disagreement region .
3 Algorithm
3.1 Main Ideas
Our algorithm employs the disagreementbased active learning framework, but modifies the main DBAL algorithm in three key ways.
Key Idea 1: WarmStart
Our algorithm applies a straightforward way of making use of the logged data inside the DBAL framework: to set the initial candidate set to be the set of classifiers that have a low empirical error on .
Key Idea 2: Multiple Importance Sampling
Our algorithm uses multiple importance sampling estimators instead of standard importance sampling estimators. As noted in the previous section, in our setting, multiple importance sampling estimators are unbiased and have lower variance, which results in a better performance guarantee.
We remark that the main purpose of using multiple importance sampling estimators here is to control the variance due to the predetermined logging policy. In the classical active learning setting without logged data, standard importance sampling can give satisfactory performance guarantees (Beygelzimer et al., 2009, 2010; Huang et al., 2015).
Key Idea 3: A Debiasing Query Strategy
The logging policy introduces bias into the logged data: some examples may be underrepresented since chooses to reveal their labels with lower probability. Our algorithm employs a debiasing query strategy to neutralize this effect. For any instance in the online data, the algorithm would query for its label with a lower probability if is relatively large.
It is clear that a lower query probability leads to fewer label queries. Moreover, we claim that our debiasing strategy, though queries for less labels, does not deteriorate our theoretical guarantee on the error rate of the final output classifier. To see this, we note that we can establish a concentration bound for multiple importance sampling estimators that with probability at least , for all ,
(3) 
where are sizes of logged data and online data respectively, and are query policy during the logging phase and the online phase respectively, and is an absolute constant (see Corollary 15 in Appendix for proof).
This concentration bound implies that for any , if is large, we can set to be relatively small (as long as ) while achieving the same concentration bound. Consequently, the upper bound on the final error rate that we can establish from this concentration bound would not be impacted by the debiasing querying strategy.
One technical difficulty of applying both multiple importance sampling and the debiasing strategy to the DBAL framework is adaptivity. Applying both methods requires that the query policy and consequently the importance weights in the error estimator are updated with observed examples in each iteration. In this case, the summands of the error estimator are not independent, and the estimator becomes an adaptive multiple importance sampling estimator whose convergence property is still an open problem (Cornuet et al., 2012).
To circumvent this convergence issue and establish rigorous theoretical guarantees, in each iteration, we compute the error estimator from a fresh sample set. In particular, we partition the logged data and the online data stream into disjoint subsets, and we use one logged subset and one online subset for each iteration.
3.2 Details of the Algorithm
The Algorithm is shown as Algorithm 1. Algorithm 1 runs in iterations where (recall is the size of the online data stream). For simplicity, we assume .
As noted in the previous subsection, we require the algorithm to use a disjoint sample set for each iteration. Thus, we partition the data as follows. The online data stream is partitioned into parts of sizes . We define for completeness. The logged data is partitioned into parts of sizes (where and we assume is an integer for simplicity. can take other values as long as it is a constant factor of ). The algorithm uses to construct an initial candidate set, and uses in iteration .
Algorithm 1 uses the disagreementbased active learning framework. At iteration (), it first constructs a candidate set which is the set of classifiers whose training error (using the multiple importance sampling estimator) on is small, and its disagreement region . At the end of the th iteration, it receives the th part of the online data stream from which it can query for labels. It only queries for labels inside the disagreement region . For any example outside the disagreement region, Algorithm 1 infers its label . Throughout this paper, we denote by , the set of examples with original labels, and by , the set of examples with inferred labels. The algorithm only observes and .
Algorithm 1 uses aforementioned debiasing query strategy, which leads to fewer label queries than the standard disagreementbased algorithms. To simplify our analysis, we round the query probability to be 0 or 1.
4 Analysis
4.1 Consistency
We first introduce some additional quantities.
Define to be the best classifier in , and to be its error rate. Let to be an absolute constant to be specified in Lemma 17 in Appendix.
We introduce some definitions that will be used to upperbound the size of the disagreement sets in our algorithm. Let . Recall . For , let , , . Let .
The following theorem gives statistical consistency of our algorithm.
Theorem 1.
There is an absolute constant such that for any , with probability at least ,
4.2 Label Complexity
We first introduce the adjusted disagreement coefficient, which characterizes the rate of decrease of the query region as the candidate set shrinks.
Definition 2.
For any measurable set , define to be
For any , , define the adjusted disagreement coefficient to be
The adjusted disagreement coefficient is a generalization of the standard disagreement coefficient (Hanneke, 2007) which has been widely used for analyzing active learning algorithms. The standard disagreement coefficient can be written as , and clearly for all .
We can upperbound the number of labels queried by our algorithm using the adjusted disagreement coefficient. (Recall that we only count labels queried during the online phase, and that )
Theorem 3.
There is an absolute constant such that for any , with probability at least , the number of labels queried by Algorithm 1 is at most:
4.3 Remarks
As a sanity check, note that when (i.e., all labels in the logged data are shown), our results reduce to the classical bounds for disagreementbased active learning with a warmstart.
Next, we compare the theoretical guarantees of our algorithm with some alternatives. We fix the target error rate to be , assume we are given logged data, and compare upper bounds on the number of labels required in the online phase to achieve the target error rate. Recall . Define , , .
The first alternative is passive learning that requests all labels for and finds an empirical risk minimizer using both logged data and online data. If standard importance sampling is used, the upper bound is . If multiple importance sampling is used, the upper bound is . Both bounds are worse than ours since and .
A second alternative is standard disagreementbased active learning with naive warmstart where the logged data is only used to construct an initial candidate set. For standard importance sampling, the upper bound is . For multiple importance sampling (i.e., out algorithm without the debiasing step), the upper bound is . Both bounds are worse than ours since and .
A third alternative is to merely use past policy to label data – that is, query on with probability in the online phase. The upper bound here is . This is worse than ours since and .
5 Experiments
We now empirically validate our theoretical results by comparing our algorithm with a few alternatives on several datasets and logging policies. In particular, we confirm that the test error of our classifier drops faster than several alternatives as the expected number of label queries increases. Furthermore, we investigate the effectiveness of two key components of our algorithm: multiple importance sampling and the debiasing query strategy.
5.1 Methodology
5.1.1 Algorithms and Implementations
To the best of our knowledge, no algorithms with theoretical guarantees have been proposed in the literature. We consider the overall performance of our algorithm against two natural baselines: standard passive learning (Passive) and the disagreementbased active learning algorithm with warm start (DBALw). To understand the contribution of multiple importance sampling and the debiasing query strategy, we also compare the results with the disagreementbased active learning with warm start that uses multiple importance sampling (DBALwm). We do not compare with the standard disagreementbased active learning that ignores the logged data since the contribution of warm start is clear: it always results in a smaller initial candidate set, and thus leads to less label queries.
Precisely, the algorithms we implement are:

Passive: A passive learning algorithm that queries labels for all examples in the online sequence and uses the standard importance sampling estimator to combine logged data and online data.

DBALw: A disagreementbased active learning algorithm that uses the standard importance sampling estimator, and constructs the initial candidate set with logged data. This algorithm only uses only our first key idea – warm start.

DBALwm: A disagreementbased active learning algorithm that uses the multiple importance sampling estimator, and constructs the initial candidate set with logged data. This algorithm uses our first and second key ideas, but not the debiasing query strategy. In other words, this method sets in Algorithm 1.

IDBAL: The method proposed in this paper: improved disagreementbased active learning algorithm with warm start that uses the multiple importance sampling estimator and the debiasing query strategy.
Our implementation of above algorithms follows Vowpal Wabbit (vw, ). Details can be found in Appendix.
5.1.2 Data
Due to lack of public datasets for learning with logged data, we convert datasets for standard binary classification into our setting. Specifically, we first randomly select 80% of the whole dataset as training data and the remaining 20% is test data. We randomly select 50% of the training set as logged data, and the remaining 50% is online data. We then run an artificial logging policy (to be specified later) on the logged data to determine whether each label should be revealed to the learning algorithm or not.
Experiments are conducted on synthetic data and 11 datasets from UCI datasets (Lichman, 2013) and LIBSVM datasets (Chang & Lin, 2011). The synthetic data is generated as follows: we generate 6000 30dimensional points uniformly from hypercube , and labels are assigned by a random linear classifier and then flipped with probability 0.1 independently.
We use the following four logging policies:

Identical: Each label is revealed with probability 0.005.

Uniform: We first assign each instance in the instance space to three groups with (approximately) equal probability. Then the labels in each group are revealed with probability 0.005, 0.05, and 0.5 respectively.

Uncertainty: We first train a coarse linear classifier using 10% of the data. Then, for an instance at distance to the decision boundary, we reveal its label with probability where is some constant. This policy is intended to simulate uncertainty sampling used in active learning.

Certainty: We first train a coarse linear classifier using 10% of the data. Then, for an instance at distance to the decision boundary, we reveal its label with probability where is some constant. This policy is intended to simulate a scenario where an action (i.e. querying for labels in our setting) is taken only if the current model is certain about its consequence.
5.1.3 Metrics and Parameter Tuning
The experiments are conducted as follows. For a fixed policy, for each dataset , we repeat the following process 10 times. At time , we first randomly generate a simulated logged dataset, an online dataset, and a test dataset as stated above. Then for , we set the horizon of the online data stream (in other words, we only allow the algorithm to use first examples in the online dataset), and run algorithm with parameter set (to be specified later) using the logged dataset and first examples in the online dataset. We record to be the number of label queries, and to be the test error of the learned linear classifier.
Let , . To evaluate the overall performance of algorithm with parameter set , we use the following area under the curve metric (see also (Huang et al., 2015)):
A small value of AUC means that the test error decays fast as the number of label queries increases.
The parameter set consists of two parameters:
For each policy, we report , the AUC under the parameter set that minimizes AUC for dataset and algorithm .
5.2 Results and Discussion
Dataset  Passive  DBALw  DBALwm  IDBAL 

synthetic  121.77  123.61  111.16  106.66 
letter  4.40  3.65  3.82  3.48 
skin  27.53  27.29  21.48  21.44 
magic  109.46  101.77  89.95  83.82 
covtype  228.04  209.56  208.82  220.27 
mushrooms  19.22  25.29  18.54  23.67 
phishing  78.49  73.40  70.54  71.68 
splice  65.97  67.54  65.73  65.66 
svmguide1  59.36  55.78  46.79  48.04 
a5a  53.34  50.8  51.10  51.21 
codrna  175.88  176.42  167.42  164.96 
german  65.76  68.68  59.31  61.54 
Dataset  Passive  DBALw  DBALwm  IDBAL 

synthetic  113.49  106.24  92.67  88.38 
letter  1.68  1.29  1.45  1.59 
skin  23.76  21.42  20.67  19.58 
magic  53.63  51.43  51.78  50.19 
covtype  262.34  287.40  274.81  263.82 
mushrooms  7.31  6.81  6.51  6.90 
phishing  42.53  39.56  39.19  37.02 
splice  88.61  89.61  90.98  87.75 
svmguide1  110.06  105.63  98.41  96.46 
a5a  46.96  48.79  49.50  47.60 
codrna  63.39  63.30  66.32  58.48 
german  63.60  55.87  56.22  55.79 
Dataset  Passive  DBALw  DBALwm  IDBAL 

synthetic  117.86  113.34  100.82  99.1 
letter  0.65  0.70  0.71  1.07 
skin  20.19  21.91  18.89  19.10 
magic  106.48  101.90  99.44  90.05 
covtype  272.48  274.53  271.37  251.56 
mushrooms  4.93  4.64  3.77  2.87 
phishing  52.96  48.62  46.55  46.59 
splice  62.94  63.49  60.00  58.56 
svmguide1  117.59  111.58  98.88  100.44 
a5a  70.97  72.15  65.37  69.54 
codrna  60.12  61.66  64.48  53.38 
german  62.64  58.87  56.91  56.67 
Dataset  Passive  DBALw  DBALwm  IDBAL 

synthetic  114.86  111.02  92.39  88.82 
letter  2.02  1.43  2.46  1.87 
skin  22.89  17.92  18.17  18.11 
magic  231.64  225.59  205.95  202.29 
covtype  235.68  240.86  228.94  216.57 
mushrooms  16.53  14.62  17.97  11.65 
phishing  34.70  37.83  35.28  33.73 
splice  125.32  129.46  122.74  122.26 
svmguide1  94.77  91.99  92.57  84.86 
a5a  119.51  132.27  138.48  125.53 
codrna  98.39  98.87  90.76  90.2 
german  63.47  58.05  61.16  59.12 
We report the AUCs for each algorithm under each policy and each dataset in Tables 2 to 4. The test error curves can be found in Appendix.
Overall Performance
The results confirm that the test error of the classifier output by our algorithm (IDBAL) drops faster than the baselines passive and DBALw: as demonstrated in Tables 2 to 4, IDBAL achieves lower AUC than both Passive and DBALw for a majority of datasets under all policies. We also see that IDBAL performs better than or close to DBALwm for all policies other than Identical. This confirms that among our two key novel ideas, using multiple importance sampling consistently results in a performance gain. Using the debiasing query strategy over multiple importance sampling also leads to performance gains, but these are less consistent.
The Effectiveness of Multiple Importance Sampling
As noted in Section 2.3, multiple importance sampling estimators have lower variance than standard importance sampling estimators, and thus can lead to a lower label complexity. This is verified in our experiments that DBALwm (DBAL with multiple importance sampling estimators) has a lower AUC than DBALw (DBAL with standard importance sampling estimator) on a majority of datasets under all policies.
The Effectiveness of the Debiasing Query Strategy
Under Identical policy, all labels in the logged data are revealed with equal probability. In this case, our algorithm IDBAL queries all examples in the disagreement region as DBALwm does. As shown in Table 2, IDBAL and DBALwm achieves the best AUC on similar number of datasets, and both methods outperform DBALw over most datasets.
Under Uniform, Uncertainty, and Certainty policies, labels in the logged data are revealed with different probabilities. In this case, IDBAL’s debiasing query strategy takes effect: it queries less frequently the instances that are wellrepresented in the logged data, and we show that this could lead to a lower label complexity theoretically. In our experiments, as shown in Tables 2 to 4, IDBAL does indeed outperform DBALwm on these policies empirically.
6 Related Work
Learning from logged observational data is a fundamental problem in machine learning with applications to causal inference
(Shalit et al., 2017), information retrieval (Strehl et al., 2010; Li et al., 2015; Hofmann et al., 2016), recommender systems (Li et al., 2010; Schnabel et al., 2016), online learning (Agarwal et al., 2014; Wang et al., 2017), and reinforcement learning
(Thomas, 2015; Thomas et al., 2015; Mandel et al., 2016). This problem is also closely related to covariate shift (Zadrozny, 2004; Sugiyama et al., 2007; BenDavid et al., 2010). Two variants are widely studied – first, when the logging policy is known, a problem known as learning from logged data (Li et al., 2015; Thomas et al., 2015; Swaminathan & Joachims, 2015a, b), and second, when this policy is unknown (Johansson et al., 2016; Athey & Imbens, 2016; Kallus, 2017; Shalit et al., 2017), a problem known as learning from observational data. Our work addresses the first problem.When the logging policy is unknown, the direct method (Dudík et al., 2011) finds a classifier using observed data. This method, however, is vulnerable to selection bias (Hofmann et al., 2016; Johansson et al., 2016). Existing debiasing procedures include (Athey & Imbens, 2016; Kallus, 2017), which proposes a treebased method to partition the data space, and (Johansson et al., 2016; Shalit et al., 2017)
, which proposes to use deep neural networks to learn a good representation for both the logged and population data.
When the logging policy is known
, we can learn a classifier by optimizing a loss function that is an unbiased estimator of the expected error rate. Even in this case, however, estimating the expected error rate of a classifier is not completely straightforward and has been one of the central problems in contextual bandit
(Wang et al., 2017), offpolicy evaluation (Jiang & Li, 2016), and other related fields. The most common solution is to use importance sampling according to the inverse propensity scores (Rosenbaum & Rubin, 1983). This method is unbiased when propensity scores are accurate, but may have high variance when some propensity scores are close to zero. To resolve this, (Bottou et al., 2013; Strehl et al., 2010; Swaminathan & Joachims, 2015a) propose to truncate the inverse propensity score, (Swaminathan & Joachims, 2015b) proposes to use normalized importance sampling, and (Jiang & Li, 2016; Dudík et al., 2011; Thomas & Brunskill, 2016; Wang et al., 2017) propose doubly robust estimators. Recently, (Thomas et al., 2015) and (Agarwal et al., 2017) suggest adjusting the importance weights according to data to further reduce the variance. We use the multiple importance sampling estimator (which have also been recently studied in (Agarwal et al., 2017) for policy evaluation), and we prove this estimator concentrates around the true expected loss tightly.Most existing work on learning with logged data falls into the passive learning paradigm, that is, they first collect the observational data and then train a classifier. In this work, we allow for active learning, that is, the algorithm could adaptively collect some labeled data. It has been shown in the active learning literature that adaptively selecting data to label can achieve high accuracy at low labeling cost (Balcan et al., 2009; Beygelzimer et al., 2010; Hanneke et al., 2014; Zhang & Chaudhuri, 2014; Huang et al., 2015). Krishnamurthy et al. (2017) study active learning with bandit feedback and give a disagreementbased learning algorithm.
To the best of our knowledge, there is no prior work with theoretical guarantees that combines passive and active learning with a logged observational dataset. Beygelzimer et al. (2009) consider active learning with warmstart where the algorithm is presented with a labeled dataset prior to active learning, but the labeled dataset is not observational: it is assumed to be drawn from the same distribution for the entire population, while in our work, we assume the logged dataset is in general drawn from a different distribution by a logging policy.
7 Conclusion and Future Work
We consider active learning with logged data. The logged data are collected by a predetermined logging policy while the learner’s goal is to learn a classifier over the entire population. We propose a new disagreementbased active learning algorithm that makes use of warm start, multiple importance sampling, and a debiasing query strategy. We show that theoretically our algorithm achieves better label complexity than alternative methods. Our theoretical results are further validated by empirical experiments on different datasets and logging policies.
This work can be extended in several ways. First, the derivation and analysis of the debiasing strategy are based on a variant of the concentration inequality (3) in subsection 3.1. The inequality relates the generalization error with the best error rate , but has a looser variance term than some existing bounds (for example (Cortes et al., 2010)). A more refined analysis on the concentration of weighted estimators could better characterize the performance of the proposed algorithm, and might also improve the debiasing strategy. Second, due to the dependency of multiple importance sampling, in Algorithm 1, the candidate set is constructed with only the th segment of data instead of all data collected so far . One future direction is to investigate how to utilize all collected data while provably controlling the variance of the weighted estimator. Finally, it would be interesting to investigate how to perform active learning from logged observational data without knowing the logging policy.
Acknowledgements
We thank NSF under CCF 1719133 for support. We thank Chris Meek, Adith Swaminathan, and Chicheng Zhang for helpful discussions. We also thank anonymous reviewers for constructive comments.
References
 (1) Vowpal Wabbit. https://github.com/JohnLangford/vowpal_wabbit/.
 Agarwal et al. (2014) Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646, 2014.
 Agarwal et al. (2017) Agarwal, A., Basu, S., Schnabel, T., and Joachims, T. Effective evaluation using logged bandit feedback from multiple loggers. arXiv preprint arXiv:1703.06180, 2017.
 Athey & Imbens (2016) Athey, S. and Imbens, G. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
 Balcan et al. (2009) Balcan, M.F., Beygelzimer, A., and Langford, J. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78–89, 2009.
 BenDavid et al. (2010) BenDavid, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 Beygelzimer et al. (2009) Beygelzimer, A., Dasgupta, S., and Langford, J. Importance weighted active learning. In ICML, 2009.
 Beygelzimer et al. (2010) Beygelzimer, A., Hsu, D., Langford, J., and Zhang, T. Agnostic active learning without constraints. In NIPS, 2010.
 Bottou et al. (2013) Bottou, L., Peters, J., QuiñoneroCandela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.

Chang & Lin (2011)
Chang, C.C. and Lin, C.J.
Libsvm: a library for support vector machines.
ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.  Cornuet et al. (2012) Cornuet, J., MARIN, J.M., Mira, A., and Robert, C. P. Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798–812, 2012.
 Cortes et al. (2010) Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442–450, 2010.
 Dudík et al. (2011) Dudík, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 1097–1104. Omnipress, 2011.
 Hanneke (2007) Hanneke, S. A bound on the label complexity of agnostic active learning. In ICML, 2007.
 Hanneke & Yang (2015) Hanneke, S. and Yang, L. Minimax analysis of active learning. Journal of Machine Learning Research, 16(12):3487–3602, 2015.
 Hanneke et al. (2014) Hanneke, S. et al. Theory of disagreementbased active learning. Foundations and Trends® in Machine Learning, 7(23):131–309, 2014.
 Hofmann et al. (2016) Hofmann, K., Li, L., Radlinski, F., et al. Online evaluation for information retrieval. Foundations and Trends® in Information Retrieval, 10(1):1–117, 2016.
 Hsu (2010) Hsu, D. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.
 Huang et al. (2015) Huang, T.K., Agarwal, A., Hsu, D. J., Langford, J., and Schapire, R. E. Efficient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems, pp. 2755–2763, 2015.
 Jiang & Li (2016) Jiang, N. and Li, L. Doubly robust offpolicy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine LearningVolume 48, pp. 652–661. JMLR. org, 2016.
 Johansson et al. (2016) Johansson, F., Shalit, U., and Sontag, D. Learning representations for counterfactual inference. In International Conference on Machine Learning, pp. 3020–3029, 2016.
 Kallus (2017) Kallus, N. Recursive partitioning for personalization using observational data. In International Conference on Machine Learning, pp. 1789–1798, 2017.

Karampatziakis & Langford (2011)
Karampatziakis, N. and Langford, J.
Online importance weight aware updates.
In
Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence
, pp. 392–399. AUAI Press, 2011.  Krishnamurthy et al. (2017) Krishnamurthy, A., Agarwal, A., Huang, T.K., Daumé, III, H., and Langford, J. Active learning for costsensitive classification. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1915–1924, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
 Li et al. (2015) Li, L., Chen, S., Kleban, J., and Gupta, A. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pp. 929–934. ACM, 2015.
 Lichman (2013) Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
 Mandel et al. (2016) Mandel, T., Liu, Y.E., Brunskill, E., and Popovic, Z. Offline evaluation of online reinforcement learning algorithms. 2016.
 Owen & Zhou (2000) Owen, A. and Zhou, Y. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):135–143, 2000.
 Rosenbaum & Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 Schnabel et al. (2016) Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., and Joachims, T. Recommendations as treatments: Debiasing learning and evaluation. arXiv preprint arXiv:1602.05352, 2016.
 Shalit et al. (2017) Shalit, U., Johansson, F. D., and Sontag, D. Estimating individual treatment effect: generalization bounds and algorithms. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3076–3085, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Strehl et al. (2010) Strehl, A., Langford, J., Li, L., and Kakade, S. M. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, pp. 2217–2225, 2010.
 Sugiyama et al. (2007) Sugiyama, M., Krauledat, M., and MÃžller, K.R. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
 Swaminathan & Joachims (2015a) Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814–823, 2015a.
 Swaminathan & Joachims (2015b) Swaminathan, A. and Joachims, T. The selfnormalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems, pp. 3231–3239, 2015b.
 Thomas & Brunskill (2016) Thomas, P. and Brunskill, E. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, 2016.
 Thomas (2015) Thomas, P. S. Safe reinforcement learning. 2015.
 Thomas et al. (2015) Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. Highconfidence offpolicy evaluation. In AAAI, pp. 3000–3006, 2015.
 Vapnik & Chervonenkis (1971) Vapnik, V. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.
 Veach & Guibas (1995) Veach, E. and Guibas, L. J. Optimally combining sampling techniques for monte carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 419–428. ACM, 1995.
 Wang et al. (2017) Wang, Y.X., Agarwal, A., and Dudik, M. Optimal and adaptive offpolicy evaluation in contextual bandits. In International Conference on Machine Learning, pp. 3589–3597, 2017.
 Zadrozny (2004) Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twentyfirst international conference on Machine learning, pp. 114. ACM, 2004.
 Zhang & Chaudhuri (2014) Zhang, C. and Chaudhuri, K. Beyond disagreementbased agnostic active learning. In Advances in Neural Information Processing Systems, pp. 442–450, 2014.
Appendix A Preliminaries
a.1 Summary of Key Notations
Data Partitions
() is the online data collected in th iteration of size . , . We define . is the logged data and is partitioned into parts of sizes . .
Recall that and contain inferred labels while and are sets of examples with original labels. The algorithm only observes and .
For (, .
Disagreement Regions
The candidate set and its disagreement region are defined in Algorithm 1. . .
, . . .
. For , , .
Other Notations
, .
For , , . . .
a.2 Elementary Facts
Proposition 4.
Suppose ,. If , then .
Proof.
Since , where the second inequality follows from the RootMean SquareArithmetic Mean inequality. Thus, . ∎
a.3 Facts on Disagreement Regions and Candidate Sets
Lemma 5.
For any , any , any , .
Proof.
The case is obvious since and .
For , since , , and thus .
For any , if , then , so .
If , then , so where the first inequality follows from the fact that implies ∎
Lemma 6.
For any , if , then .
Proof.
For any that , if , then , so . If , then , so . ∎
The following lemma is immediate from definition.
Lemma 7.
For any , any , .
a.4 Facts on Multiple Importance Sampling Estimators
We recall that is an i.i.d. sequence. Moreover, the following fact is immediate by our construction that are disjoint and that is determined by .
Fact 8.
For any , conditioned on , examples in are independent, and examples in are i.i.d.. Besides, for any , , are independent.
Unless otherwise specified, all probabilities and expectations are over the random draw of all random variables (including