1 Introduction
Counterfactual learning from observational data is an emerging problem that arises naturally in many applications. In this problem, the learner is given observational data – a set of examples selected according to some policy along with their labels – as well as access to the policy that selects the examples, and the goal is to construct a classifier with high performance on an entire population, not just the observational data distribution. An example is learning to predict if a treatment will be effective based on features of a patient. Here, we have some observational data on how the treatment works for patients that were assigned to it, but if the treatment is given only to a certain category of patients, then the data is not reflective of the population. Thus the main challenge in counterfactual learning is how to counteract the effect of the observation policy and build a classifier that applies more widely.
This work considers counterfactual learning in the active setting, which has received very recent attention in a few different contexts [22, 18, 2]. In addition to observational data, the learner has an online stream of unlabeled examples drawn from the underlying population distribution, and the ability to selectively label a subset of these in an interactive manner. The learner’s goal is to again build a classifier while using as few label queries as possible. The advantage of the active over the passive is its potential for more labelefficient solutions; the question however is how to do this algorithmically.
Prior work in this problem has looked at both probabilistic inference [18, 2] as well as a standard classification [22], which is the setting of our work. [22] uses a modified version of disagreementbased active learning [6, 8, 3, 10]
, along with an importance weighted empirical risk to account for the population. However, a problem with this approach is that the importance weighted risk estimator can have extremely high variance when the importance weights – that reflect the inverse of how frequently an instance in the population is selected by the policy – are high; this may happen if, for example, certain patients are rarely given the treatment. This high variance in turn results in high label requirement for the learner.
The problem of high variance in the loss estimator is addressed in the passive case by minimizing a form of counterfactual risk [19] – an importance weighted loss that combines a variance regularizer and importance weight clipping or truncation to achieve low generalization error. A plausible solution is to use this risk for active learning as well. However, this cannot be readily achieved for two reasons. The first is that the variance regularizer itself is a function of the entire dataset, and is therefore challenging to use in interactive learning where data arrives sequentially. The second reason is that the minimizer of the (expected) counterfactual risk depends on , the data size, which again is inconvenient for learning in an interactive manner.
In this work, we address both challenges. To address the first, we use, instead of a variance regularizer, a novel regularizer based on the second moment; the advantage is that it decomposes across multiple segments of the data set as which makes it amenable for active learning. We provide generalization bounds for this modified counterfactual risk minimizer, and show that it has almost the same performance as counterfactual risk minimization with a variance regularizer
[19]. The second challenge arises because disagreementbased active learning ensures statistical consistency by maintaining a set of plausible minimizers of the expected risk. This is problematic when the minimizer of the expected risk itself changes between iterations as in the case with our modified regularizer. We address this challenge by introducing a novel variant of disagreementbased active learning which is always guaranteed to maintain the population error minimizer in its plausible set.Additionally, to improve sample efficiency, we then propose a third novel component – a new sampling algorithm for correcting sample selection bias that selectively queries labels of those examples which are underrepresented in the observational data. Combining these three components gives us a new algorithm. We prove this newly proposed algorithm is statistically consistent – in the sense that it converges to the true minimizer of the population risk given enough data. We also analyze its label complexity, show it is better than prior work [22], and demonstrate the contribution of each component of the algorithm to the label complexity bound.
2 Related Work
We consider learning with logged observational data where the logging policy that selects the samples to be observed is known to the learner. The standard approach is importance sampling to derive an unbiased loss estimator [16], but this is known to suffer from high variance. One common approach for reducing variance is to clip or truncate the importance weights [5, 19]
, and we provide a new principled method for choosing the clipping threshold with theoretical guarantees. Another approach is to add a regularizer based on empirical variance to the loss function to favor models with low loss variance
[14, 19, 15]. Our second moment regularizer achieves a similar effect, but has the advantage of being applicable to active learning with theoretical guarantees.In this work, in addition to logged observational data, we allow the learner to actively acquire additional labeled examples. The closest to our work is [22], the only known work in the same setting. [22] and our work both use disagreementbased active learning (DBAL) framework [6, 8, 3, 10] and multiple importance sampling [21] for combining actively acquired examples with logged observational data. [22] uses an importance weighted loss estimator which leads to high variance and hence high sample complexity. In our work, we incorporate a more efficient variancecontrolled importance sampling into active learning and show that it leads to a better label complexity.
consider active learning for predicting individual treatment effect which is similar to our task. They take a Bayesian approach which does not need to know the logging policy, but assumes the true model is from a known distribution family. Additionally, they do not provide label complexity bounds. A related line of research considers active learning for domain adaptation, and their methods are mostly based on heuristics
[17, 24], utilizing a clustering structure [11], or nonparametric methods [12]. In other related settings, [23] considers warmstarting contextual bandits targeting at minimizing the cumulative regret instead of the final prediction error; [13] studies active learning with bandit feedback without any logged observational data.3 Problem Setup
We are given a instance space , a label space , and a hypothesis class . Let be an underlying data distribution over . For simplicity, we assume is a finite set, but our results can be generalized to VCclasses by standard arguments [20, 15].
In the passive setting for learning with observational data, the learner has access to a logged observational dataset generated from the following process. First, examples are drawn i.i.d. from . Then a logging policy
that describes the probability of observing the label is applied. In particular, for each example
(), an independent Bernoulli random variable
with expectation is drawn, and then the label is revealed to the learner if ^{1}^{1}1This generating process implies the standard unconfoundedness assumption in the counterfactual inference literature: . In other words, the label is conditionally independent with the action (indicating whether the label is observed) given the instance .. We call the logged dataset. We assume the learner knows the logging policy , and only observes instances , indicators , and revealed labels .In the active learning setting, in addition to the logged dataset, the learner has access to a stream of online data. In particular, there is a stream of additional examples drawn i.i.d. from distribution . At time (), the learner applies a query policy to compute an indicator , and then the label is revealed if . The computation of may in general be randomized, and is based on the observed logged data , previously observed instances , decisions, and observed labels .
We focus on the active learning setting, and the goal of the learner is to learn a classifier from observed logged data and online data. Fixing , , , , the performance is measured by: (1) the error rate of the output classifier, and (2) the number of label queries on the online data. Note that the error rate is over the entire population instead of conditioned on the logging policy, and that we assume the labels of the logged data come at no cost. In this work, we are interested in the situation where , the size of the online stream, is smaller than .
Notation
Unless otherwise specified, all probabilities and expectations are over the draw of all random variables . Define . Define the optimal classifier , . For any , define the ball around as . For any , define the disagreement region .
Due to space limit, all proofs are postponed to Appendix.
4 VarianceControlled Importance Sampling for Passive Learning with Observational Data
In the passive setting, the standard method to overcome sample selection bias is to optimize the importance weighted (IW) loss
. This loss is an unbiased estimator of the population error
, but its variance can be high, leading to poor solutions. Previous work addresses this issue by adding a variance regularizer [14, 19, 15] and clipping/truncating the importance weight [5, 19]. However, the variance regularizer is challenging to use in interactive learning when data arrives sequentially, and it is unclear how the clipping/truncating threshold should be chosen to yield good theoretical guarantees.In this paper, as an alternative to the variance regularizer, we propose a novel second moment regularizer which achieves a similar error bound to the variance regularizer [15]; and this motivates a principled choice of the clipping threshold.
4.1 SecondMomentRegularized Empirical Risk Minimization
Intuitively, between two classifiers with similarly small training loss , the one with lower variance should be preferred, since its population error would be small with a higher probability than the one with higher variance. Existing work encourages low variance by regularizing the loss with the estimated variance . Here, we propose to regularize with the estimated second moment , an upper bound of . We have the following generalization error bound for regularized ERM.
Theorem 1.
Let . For any , then with probability at least ,
Theorem 1 shows a error rates similar to the one for the variance regularizer [15]. However, the advantage of using the second moment is the decomposability: . This makes it easier to analyze for active learning that we will discuss later.
Recall for the unregularized importance sampling loss minimizer , the error bound is [7, 22]. In Theorem 1, the extra term is due to the deviation of around , and is negligible when is large. In this case, learning with a second moment regularizer gives a better generalization bound.
This improvement in generalization error is due to the regularizer instead of tighter analysis. Similar to [14, 15], we show in Theorem 2 that for some distributions, the error bound in Theorem 1 cannot be achieved by any algorithm that simply optimizes the unregularized empirical loss.
Theorem 2.
For any , , there is a sample space , a hypothesis class , a distribution , and a logging policy such that , and that with probability at least over the draw of , if , then .
4.2 Clipped Importance Sampling
The variance and hence the error bound for secondmoment regularized ERM can still be high if is large. This factor arises inevitably to guarantee the importance weighted estimator is unbiased. Existing work alleviates the variance issue at the cost of some bias by clipping or truncating the importance weight. In this paper, we focus on clipping, where the loss estimator becomes . This estimator is no longer unbiased, but as the weight is clipped at , so is the variance. Although studied previously [5, 19], to the best of our knowledge, it remains unclear how the clipping threshold can be chosen in a principled way.
We propose to choose . This choice minimizes an error bound for the clipped secondmoment regularized ERM and we formally show this in Appendix E. Example 30 in Appendix E shows this clipping threshold avoids outputting suboptimal classifiers. The choice of implies that the clipping threshold should be larger as the sample size increases, which confirms the intuition that with a larger sample size the variance becomes less of an issue than the bias. We have the following generalization error bound.
Theorem 3.
Let . For any , with probability at least ,
We always have as . Thus, this error bound is always no worse than that without clipping asymptotically.
5 Active Learning with Observational Data
Next, we consider active learning where in addition to a logged observational dataset the learner has access to a stream of unlabeled samples from which it can actively query for labels. The main challenges are how to control the variance due to the observational data with active learning, and how to leverage the logged observational data to reduce the number of label queries beyond simply using them for warmstart.
To address these challenges, we first propose a nontrivial change to the DisagreementBased Active Learning (DBAL) so that the variancecontrolled importance sampling objective can be incorporated. This modified algorithm also works in a general costsensitive active learning setting which we believe is of independent interest. Second, we show how to combine logged observational data with active learning through multiple importance sampling (MIS). Finally, we propose a novel sample selection bias correctiontechnique to query regions underexplored in the observational data more frequently. We provide theoretical analysis demonstrating that the proposed method gives better label complexity guarantees than previous work [22] and other alternative methods.
Key Technique 1: DisagreementBased Active Learning with VarianceControlled Importance Sampling
The DBAL framework is a widelyused general framework for active learning [6, 8, 3, 10]. This framework iteratively maintains a candidate set to be a confidence set for the optimal classifier. A disagreement region is then defined accordingly to be the set of instances on which there are two classifiers in that predict labels differently. At each iteration, it draws a set of unlabeled instances. The labels for instances falling inside the disagreement region are queried; otherwise, the labels are inferred according to the unanimous prediction of the candidate set. These instances with inferred or queried labels are then used to shrink the candidate set.
The classical DBAL framework only considers the unregularized 01 loss. As discussed in the previous section, with observational data, unregularized loss leads to suboptimal label complexity. However, directly adding a regularizer breaks the statistical consistency of DBAL, since the proof of its consistency is contingent on two properties: (1) the minimizer of the population loss stays in all candidate sets with high probability; (2) the loss difference for any does not change no matter how examples outside the disagreement region are labeled.
Unfortunately, if we add a variance based regularizer (either estimated variance or second moment), the objective function has to change as the sample size increases, and so does the optimal classifier w.r.t. regularized population loss . Consequently, may not stay in all candidate sets. Besides, the difference of the regularized loss changes if labels of examples outside the disagreement region are modified, breaking the second property.
To resolve the consistency issues, we first carefully choose the definition of the candidate set and guarantee the optimal classifier w.r.t. the prediction error , instead of the regularized loss , stays in candidate sets with high probability. Moreover, instead of the plain variance regularizer, we apply the second moment regularizer and exploit its decomposability property to bound the difference of the regularized loss for ensuring consistency.
Key Technique 2: Multiple Importance Sampling
MIS addresses how to combine logged observational data with actively collected data for training classifiers [1, 22]. To illustrate this, for simplicity, we assume a fixed query policy is used for active learning. To make use of both collected by and collected by , one could optimize the unbiased importance weighted error estimator which can have high variance and lead to poor generalization error. Here, we apply the MIS estimator which effectively treats the data as drawn from a mixture policy . is also unbiased, but has lower variance than and thus gives better error bounds.
Key Technique 3: Active Sample Selection Bias Correction
Another advantage to consider active learning is that the learner can apply a strategy to correct the sample selection bias, which improves label efficiency further. This strategy is inspired from the following intuition: due to sample selection bias caused by the logging policy, labels for some regions of the sample space may be less likely to be observed in the logged data, thus increasing the uncertainty in these regions. To counter this effect, during active learning, the learner should query more labels from such regions.
We formalize this intuition as follows. Suppose we would like to design a single query strategy that determines the probability of querying the label for an instance during the active learning phase. For any , we have the following generalization error bound for learning with logged examples and unlabeled examples from which the learner can select and query for labels (for simplicity of illustration, we use the unclipped estimator here)
We propose to set which only queries instances if is small. This leads to fewer queries while guarantees an error bound close to the one achieved by setting that queries every instance. In Appendix E we give an example, Example 31, showing the reduction of queries due to this strategy.
The sample selection bias correctionstrategy is complementary to the DBAL technique. We note that a similar query strategy is proposed in [22], but the strategy here stems from a tighter analysis and can be applied with variance control techniques discussed in Section 4, and thus gives better label complexity guarantees as to be discussed in the analysis section.
5.1 Algorithm
Putting things together, our proposed algorithm is shown as Algorithm 1
. It takes the logged data and an epoch schedule as input. It assumes the logging policy
and its distribution are known (otherwise, these quantities can be estimated with unlabeled data).Algorithm 1 uses the DBAL framework that recursively shrinks a candidate set and its corresponding disagreement region to save label queries by not querying examples outside . In particular, at iteration , it computes a clipping threshold (step 5) and MIS weights which are used to define the clipped MIS error estimator and two second moment estimators
The algorithm shrinks the candidate set by eliminating classifiers whose estimated error is larger than a threshold that takes the minimum empirical error and the second moment into account (step 7), and defines a corresponding disagreement region as the set of all instances on which there are two classifiers in the candidate set that predict labels differently. It derives a query policy with the sample selection bias correctionstrategy (step 9). At the end of iteration , it draws unlabeled examples. For each example with , if , the algorithm queries for the actual label and sets , otherwise it infers the label and sets . These examples and their inferred or queried labels are then used in subsequent iterations. In the last step of the algorithm, a classifier that minimizes the clipped MIS error with the second moment regularizer over all received data is returned.
5.2 Analysis
We have the following generalization error bound for Algorithm 1. Despite not querying for all labels, our algorithm achieves the same asymptotic bound as the one that queries labels for all online data.
Theorem 4.
Let be the final clipping threshold used in step 20. There is an absolute constant such that for any , with probability at least ,
Next, we analyze the number of labels queried by Algorithm 1 with the help of following definitions.
Definition 5.
For any , define the modified disagreement coefficient . Define .
The modified disagreement coefficient measures the probability of the intersection of two sets: the disagreement region for the ball around and where the propensity score is smaller than . It characterizes the size of the querying region of Algorithm 1. Note that the standard disagreement coefficient [9], which is widely used for analyzing DBAL in the classical active learning setting, can be written as . Here, the modified disagreement coefficient modifies the standard definition to account for the reduction of the number of label queries due to the sample selection bias correctionstrategy: Algorithm 1 only queries examples on which is lower than some threshold, hence . Moreover, our modified disagreement coefficient is always smaller than the modified disagreement coefficient of [22] (denoted by ) which is used to analyze their algorithm.
Additionally, define to be the size ratio of logged and online data, let , define to be the minimum ratio between the clipping threshold and maximum MIS weight ( since by the choice of ), and define to be the maximum clipping threshold. Recall .
The following theorem upperbounds the number of label queries by Algorithm 1.
Theorem 6.
There is an absolute constant such that for any , with probability at least , the number of labels queried by Algorithm 1 is at most:
5.3 Discussion
In this subsection, we compare the performance of the proposed algorithm and some alternatives to understand the effect of proposed techniques. The theoretical performance of learning algorithms is captured by label complexity, which is defined as the number of label queries required during the active learning phase to guarantee the test error of the output classifier to be at most (here is the optimal error , and is the target excess error). This can be derived by combining the upper bounds on the error (Theorem 4) and the number of queries (Theorem 6).

The label complexity is without clipping. This is derived by setting the final clipping threshold . It is worse since .

The label complexity is if regularizers are removed further. This is worse since .

The label complexity is if we further remove the sample selection bias correctionstrategy. Here the standard disagreement coefficient is used ().

The label complexity is if we further remove the MIS technique. It can be shown , so MIS gives a better label complexity bound.

The label complexity is if DBAL is further removed. Here, all online examples are queried. This demonstrates that DBAL decreases the label complexity bound by a factor of which is at most 1 by definition.

Finally, the label complexity is for [22], the only known algorithm in our setting. Here, , , and . Thus, the label complexity of the proposed algorithm is better than [22]. This improvement is made possible by the second moment regularizer, the principled clipping technique, and thereby the improved sample selection bias correctionstrategy.
6 Conclusion
We consider active learning with logged observational data where the learner is given an observational data set selected according to some logging policy, and can actively query for additional labels from an online data stream. Previous work applies disagreementbased active learning with an importance weighted loss estimator to account for counterfactuals, which has high variance and leads to a high label complexity. In this work, we utilize variance control techniques for importance weighted estimators, and propose a novel variant of DBAL to make it amenable to variancecontrolled importance sampling. Based on these improvements, a new sample selection bias correctionstrategy is proposed to further boost label efficiency. Our theoretical analysis shows that the proposed algorithm is statistically consistent and more labelefficient than prior work and alternative methods.
Acknowledgements
We thank NSF under CCF 1719133 for support.
References
 [1] Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. Effective evaluation using logged bandit feedback from multiple loggers. arXiv preprint arXiv:1703.06180, 2017.

[2]
Onur Atan, William R. Zame, and Mihaela van der Schaar.
Sequential patient recruitment and allocation for adaptive clinical
trials.
In Kamalika Chaudhuri and Masashi Sugiyama, editors,
Proceedings of Machine Learning Research
, volume 89 of Proceedings of Machine Learning Research, pages 1891–1900. PMLR, 16–18 Apr 2019.  [3] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78–89, 2009.
 [4] P Borjesson and CE Sundberg. Simple approximations of the error function q (x) for communications applications. IEEE Transactions on Communications, 27(3):639–643, 1979.
 [5] Léon Bottou, Jonas Peters, Joaquin QuiñoneroCandela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
 [6] D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning. Machine Learning, 15(2), 1994.
 [7] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.
 [8] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.
 [9] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.
 [10] Steve Hanneke et al. Theory of disagreementbased active learning. Foundations and Trends® in Machine Learning, 7(23):131–309, 2014.

[11]
David Kale, Marjan Ghazvininejad, Anil Ramakrishna, Jingrui He, and Yan Liu.
Hierarchical active transfer learning.
In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 514–522. SIAM, 2015.  [12] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariateshift. In Conference On Learning Theory, pages 1882–1886, 2018.
 [13] Akshay Krishnamurthy, Alekh Agarwal, TzuKuo Huang, Hal Daumé, III, and John Langford. Active learning for costsensitive classification. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1915–1924, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 [14] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In COLT 2009The 22nd Conference on Learning Theory, 2009.
 [15] Hongseok Namkoong and John C Duchi. Variancebased regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
 [16] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 [17] Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, and Scott L DuVall. Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97–112. Springer, 2011.
 [18] Iiris Sundin, Peter Schulam, Eero Siivola, Aki Vehtari, Suchi Saria, and Samuel Kaski. Active learning for decisionmaking from imbalanced observational data. arXiv preprint arXiv:1904.05268, 2019.
 [19] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
 [20] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.
 [21] Eric Veach and Leonidas J Guibas. Optimally combining sampling techniques for monte carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 419–428. ACM, 1995.
 [22] Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. Active learning with logged data. In International Conference on Machine Learning, pages 5517–5526, 2018.
 [23] Chicheng Zhang, Alekh Agarwal, Hal Daumé III, John Langford, and Sahand N Negahban. Warmstarting contextual bandits: Robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301, 2019.

[24]
Zihan Zhang, Xiaoming Jin, Lianghao Li, Guiguang Ding, and Qiang Yang.
Multidomain active learning for recommendation.
In
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [25] Andre M Zubkov and Aleksandr A Serov. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013.
Appendix A Preliminaries
a.1 Summary of Key Notations
Data
is the logged data. () is the online data collected in the th iteration of size , and equals either the actual label drawn from the data distribution or the inferred label according to the candidate set at iteration . .
For convenience, we additionally define to be the data set with the actual labels drawn from the data distribution, and . The algorithm only observes and , and are used for analysis only.
For ,, and we define , , . We assume for .
Recall that is an independent sequence, and furthermore is an i.i.d. sequence drawn from . For (, . Unless otherwise specified, all probabilities and expectations are over the random draw of all random variables .
Loss and Second Moment
The test error , the optimal classifier , and the optimal error . At the th iteration, the Multiple Importance Sampling (MIS) weight . The clipped MIS loss estimator . The (unclipped) MIS loss estimator .
The clipped second moment , . The clipped secondmoment estimators , . The unclipped second moments (,) and second moment estimators (,) are defined similarly.
Disagreement Regions
The ball around is defined as , and the disagreement region of is