Issues of representation in sampling have plagued data analysis ever since the first survey was taken. A typical situation is when we sample a variable from a population, and wish to calculate , but we cannot trust that the average is a consistent estimator because we know that our sampling procedure was not uniform but rather biased. Horvitz-Thompson estimation , also known as importance weighting, inverse probability weighting, or propensity scoring, is a method by which the sampling probability of sample , call this , can be used to correct for sampling bias by inverse probability weighting:
. In the event that these sampling probabilities are all positive, then this estimate is unbiased for the population mean. Unfortunately, this requires us to know the sampling probabilities which is often unknown in problems with partially missing data including covariate shift in machine learning and causal inference.
Partial missingness is a prevalent feature of real world datasets. Often there are missing response variables, but the probability of sample being missing is unknown. This problem is made tractable if we have covariates, , that are non-missing, and we believe that the missingness is only dependent on . Many methods have been proposed that estimate the sampling probability
from the covariates using density estimation, or will estimate the missingness probability with a soft classifier. We will revisit a variant of an old method, called nearest neighbor hot-deck imputation (which we call simply 1NN imputation), in which the missing observations are replaced with one of their nearest neighbors in the covariate space.
The analysis of 1NN imputation can be reduced to studying the 1NN measure, which is an approximation of a desired population measure, (over the Borel sigma field in ). Suppose that we are given data points drawn iid from , which is our biased sampling procedure. For an integrable function , we would like to approximate , but only with evaluations over the data, and without access to the density (we assume that ). Let be the nearest neighbor of in Euclidean distance within . We will approximate the action of by the 1NN measure, , which we define as,
We will see that in probability, which will imply consistency of 1NN imputation. This limit point can be written as where which is also the limit of the ideal Horvitz-Thompson estimator.
Contribution: In this paper, we establish that 1NN imputation is the Horvitz-Thompson estimator in the large data limit. To demonstrate this we prove that the 1NN measure, , is weakly consistent for the desired population measure.
1.1 Imputation of missing data
Throughout, we will assume that there are fully observed covariates for
which are continuous random variables. Letbe a non-missingness indicator such that when then is observed for . Let , such that the observed data is . The goal of imputation is to obtain surrogates, , for the missing data. In this work, we will consider a variant of imputation where we augment as well, which we will see is also an importance weighting scheme. The imputed values will be used in downstream computations, and the goal of imputation is for the resulting statistics to be similar (in distribution) to what would have been obtained if the true values were observed. Throughout, we assume that the triplets
are drawn iid from some joint distribution. (Also, we will useto denote a dummy draw from this distribution.)
If the missingness is completely at random (MCAR) then this means that there is no dependence between the missingness and the data . In this case, nothing needs to be done in order to consistently estimate moments of the populations, we can just compute their empirical counterparts on the non-missing data. Instead, we will assume we are in the more realistic situation, that our data is missing at random (MAR), which means that the missingness is independent of the (possibly unobserved) response conditional on the covariate . The directed Markov graph consistent with this independence assumption is . While this assumption adds significant challenges to any estimation, it is not intractable, because we have the chance to understand the missingness mechanism and then correct for it.
We will consider statistics that are aggregates of non-linear elementwise operations (i.e. empirical moments). Most common operations on data-frames, such as sum, mean, variance, covariance, and count along with grouping operations and filters can be expressed in this way. Specifically, letbe a possibly non-linear integrable function then we will focus on estimating the following functional,
which is the expectation of for the missing population. We devote our attention to this because we already have consistent estimators for and .
For example, suppose we would like to express the following query, select mean(log(Y)) where X < 1 and M = 0, we could use the function (in this example, ). In words, this is to take the mean of the log of ’s such that is less than 1, but only for the missing data. Similarly, grouping can be accomplished by using multiple
’s for each filter corresponding to that group (equivalently, using a vector
). In addition to these queries, other common linear predictors such as kernel smoothers and local linear regression can be expressed with a dictionary of functions.
Of course, we are not able to make such a query because it is based on unobserved data. We will consider single imputation where we replace the missing datum , with a non-missing datum, . By imputing the missing data, we can then make the following query,
So, we can think of this as a weighted sum of the non-missing samples , where the weight is the proportion of times is the imputed value for a missing entry. In this way, that proportion acts as a surrogate to in a Horvitz-Thompson estimator. One important distinction between this method of imputation and standard imputation is that we not only fill in the missing data , but also augment the covariates . If is only dependent on then these methods result in the same estimate , and only when the query is dependent on the covariates does this matter.
Regression-based imputation constructs to be an estimate of
, such as the K-nearest neighbors, linear regression, or random forests prediction. This can result in an imputed functionalthat is severely biased for because in general. For example, suppose that we are estimating the second moment of the missing ’s, so that . Even if we had access to the regression function , if we imputed with this value then
if has non-zero conditional variance. Hot-deck imputation, selects from the non-missing data , and ideally, is distributionally similar to the unobserved (when we condition on ). This has the potential to alleviate our biasedness problem because non-linear moments are unbiased if is drawn iid to , so this will mean that may be a good point estimate for . Hot-deck imputation has been used in various forms [21, 11, 1] but is typically less commonly used, partly due to our heretofore lack of theoretical understanding.
We do not know the exact distribution to impute from, but here we can turn to nearest neighbors to find approximate samples. The idea behind nearest neighbor methods is that with enough data we can think of the closest points in the covariate space as having a nearly identically distributed response variable. So imputing with a nearest neighbor is a reasonable choice, and our method goes as follows.
Nearest neighbor imputation: Let be the th nearest neighbor of amoungst the non-missing data, , in the Euclidean metric. Then impute the missing value for as such that and compute according to (2).
1.2 Covariate shift and domain adaptation
Typically, in statistical learning we assume that our test and training sets are random draws from the same population, but there are situations in which this may be violated. When the training and test datasets have different distributions, then we have covariate shift . We think of our training data as our non-missing entries, , and our test data as our missing entries, , we will assume that the joint distribution is missing at random in this context as well.
Non-representative training data is a serious concern when we consider possible policy effects of our predictors. Several questions naturally arise when faced with possible covariate shift. First, can we detect covariate shift so that we can correct for it, or even resample the training data? Second, if we fit a classifier to our training data, can we estimate the test error, thereby assessing the impact of covariate shift without explicitly taking the shift into account? Third, can we adapt our classifiers to target test error given that we have observed covariate shift?
Detecting covariate shift is essentially a two sample testing problem, where we are determining if the covariates come from the sample population between training and test samples. Two-sample tests have been used for covariate shift in , and we will not address this problem. With that said, a condition that we require is that the training and test distributions are not too dissimilar in terms of Chi-squared divergence. It would be best to use these methods in concert with a covariate shift detector, to assess the existence and severity of covariate shift.
In order to assess test error, we will consider splitting our training set into training () and validation () sets. By training a classifier to the smaller training set , we can then obtain the validation losses for each . Then we can impute the losses, , for the test set, , with nearest neighbor imputation where the non-missing losses are restricted to the validation set. By conditioning on the new training data , we fix , and thereby this setting is the same as the standard imputation setting of the previous subsection, where . The test error is then estimated according to (2), and denote this estimator, .
Adapting one’s classifiers to the test population is called domain adaptation. Instead of relying on a classifier that is unaware of the covariate shift, we will consider modifying the classifier in order to ensure that the result is adapted to the test population. We would like to select a classifier that minimizes the true risk for the test population,
for where is a set of classifiers.  proposes estimating a surrogate for the density ratio , and then use inverse probability weighting with a plug-in estimate of the density ratio. Instead, we propose to estimate the risk by using 1NN imputation on the test set to obtain a shift-aware training set, and use this to train the classifier by minimizing the empirical risk. We will see that in the limit as , 1NN imputation is in fact implicit inverse probability weighting, making a pleasing connection to these methods.
1.3 Related work
 outlines many of the common methods for dealing with missing data, including imputation, importance weighting, and likelihood based methods (such as the EM algorithm). Importance weighting, or propensity scoring, is commonly used for correcting sampling selection bias , which arises in survey sampling and causal inference. Tree-based methods and SVMs have been used to obtain estimates of the ideal importance weights in [19, 16]. Hot-deck imputation has been studied empirically in [21, 11, 1] but is not a commonly used as model-based approaches.
Since the seminal work of , researchers have developed a rich understanding of -nearest neighbor methods for classification and regression , and density estimation [18, 10].  showed that the risk of the 1NN predictor bounds the true risk by a constant factor, but to achieve diminishing excess risk one requires that . Much of this theory is based on Stone’s lemma which states that, in Euclidean space, the number of data points of which a single point can be a nearest neighbor is bounded by a universal constant. In general metric spaces, it has been shown that we only require Lebesgue differentiability 
for KNN methods to be risk consistent. More recently, proposed a penalized variant of the the 1NN classifier in order to obtain a risk consistent classifier over metric spaces with more user friendly risk bounds. These settings differ from ours because they are done in the absence of covariate shift and do not imply weak consistency of the simple 1NN method. Recently, other uses of nearest neighbor methods, such as entropy estimation , and estimating residual variance  have stretched the theoretical limits of Stone’s lemma, and have required a more nuanced understanding of the asymptotic properties of 1NN methods. In , they show that in Euclidean space the -measure of a Voronoi cell, for data points drawn from , is approximately . It is reasonable to guess from this result that the -measure of a Voronoi cell, for data points drawn from approaches their relative density divided by , which is our main result.
Domain adaptive methods can be divided into two broad camps: specialized learning algorithms that are designed to target the test distribution, and weighting based modifications to existing learning algorithms. If the testing and training distributions are known, then it is natural to use importance weighted losses in training .  demonstrated excess risk bounds in this setting for hypothesis classes that are finite or have bounded pseudo-dimension. That work also considers importance weighting with imperfect weights, but those bounds leave the quality of the weights unknown (which this paper addresses for 1NN).  considers estimating the importance weights via kernel mean matching, which we can think of as a kernelized version of the 1NN method that we consider.  considers a modified objective for discriminative learning that simultaneously learns the weights and classifier, but little is known about its theoretical performance. 
considers an active KNN algorithm for covariate shift problems, and proves a risk bound in this setting, but this result is not directly comparable to ours due to the active learning setting, and the fact that they are able to use standard theoretical tools for KNN because they let.
2 Weak consistency of the 1-nearest neighbor measure
We will revisit the issue of 1NN imputation in the following section, but for now we will establish some key results that may be of independent interest. For the remainder of the section, we will, without loss of generality, assume that we are given data drawn iid from . We will consider measureable functions of the covariates , , and would like to estimate the expectation with respect to the measure (the missing population) but only with samples from (non-missing population) using , (1). We do this with the understanding that in the imputation context , (by reordering the samples), but the following results hold more generally, and they will be stated as such.
Another interpretation of the 1NN measure is by the -measure of Voronoi cells. Consider the Voronoi cells, which are the points in such that is the closest element of ,
Thus, we can rewrite the 1NN measure in terms of the Voronoi cells,
What follows is the main conclusion of this section, the weak consistency of the 1NN measure. First, let us establish some assumptions, which require that is doubling, and are not too dissimilar, and that have finite second and fourth moment.
We will assume that have densities with respect to the Lebesgue measure (), respectively, and that . This means that without loss of generality, although there are situations in which this could be unbounded at an accumulation point. Also, assume that is a doubling measures almost surely in the sense that there exists Borel set and constant such that and for any , ,
The Chi-squared divergence between and is finite, . Also, the test function is measurable and has finite second and fourth moments, and .
Notice that a bounded is much less restrictive then assuming that the density ratio is bounded uniformly.
Theorem 1 is the main result of this section, and we will divide the proof into two thrusts: demonstrating asymptotic unbiasedness and diminishing variance.
2.1 Implicit Horvitz-Thompson weighting and asymptotic unbiasedness
It has been shown by  that the consistency of nearest neighbor classifiers is reliant on the existence of a Lebesgue differentiation theorem (LDT) over the covariate space. Our theory diverges from the classification theory, and it is not enough to have the LDT over Euclidean space. We will require Hardy-Littlewood maximal inequalities, from which one can derive an LDT.
For a measurable function over measure , define the maximal function,
where is the ball of radius in around . The following fundamental theorem bounds the maximal function by various norms of , and it follows from the fact that is doubling.
Theorem 2 (Hardy-Littlewood Maximal Inequality ).
For doubling measure, , measurable , and all ,
and for ,
almost everywhere where is dependent on and the doubling constant of .
Hardy-Littlewood maximal inequalities imply a LDT for measure (see Theorem 5 in the Appendix), which we use in our proof as well as the standard LDT for the Lebesgue measure. The following theorem makes precise our claim that the -measure of Voronoi cells approaches the density ratio in the limit.
Under Assumption 1, in expectation, the -measure of a Voronoi cell around conditional on converges to the density ratio in the limit, namely,
for -almost all . Furthermore, we have the following bounds for all ,
where are constants depending on the doubling constant only.
Theorem 3 gives credence to the claim that the 1NN measure is performing Horvitz-Thompson estimation. We can now use this to show that the 1NN measure is (weakly) asymptotically unbiased for . The expected 1NN measure is,
We have pointwise convergence by (3),
and the RHS has expectation . We can establish dominated convergence by
where is the Chi-squared divergence. We come to our conclusion.
We conclude this subsection with our proof sketch of Theorem 3. This result, (3), for the case when can be found in . The proof (at the end of this subsection) borrows heavily from the corresponding result in that work, although we take an approach that relies on the doubling property of the Lebesgue measure, , and the sampling measure, . I had some difficulty in understanding the validity of the proof of Theorem 2.1(i), particularly how the Lebesgue differentiation theorem applied to fixed points and can then translate to the similar result uniformly over . This result seems to reduce to their case when although our proof requires the additional doubling assumption on . We also note that we still require that have densities with respect to the Lebesgue measure, and this seems to be an insurmountable requirement, although it will likely hold if we replace Lebesgue with a Haar measure over another topological vector space.
Proof sketch of Theorem 3. Throughout, let be some constant. Let and notice that
where . By integration by parts, where . Notice that by the doubling property of the measure (since it has a density with respect to Lebesgue),
Notice that in the following display is distributed uniform,
Hence, And so we have that
where . This is lower bounded by (a constant times) the maximum of independent uniform random variables, and so the maximum has a distribution for uniform . We can see that , and (5) follows.
Let . By the same reasoning,
which we can make arbitrarily small by choosing small. Considering the other component of this integral, for , . On the event that , we have that for any for large enough, . We can show that converges regularly to and use the LDT (see Theorem 5) for regularly convergent sets. By several applications of the LDT,
Thus, by similar calculations as above,
2.2 1NN measure variance bound
We have established that the 1NN measure is asymptotically unbiased, but consistency remains to be shown. Our main tool for showing this consistency is the following variance bound, which holds without any additional assumptions then those stated within.
Let and (). Let denote the -nearest neighbor of within , then for any -measurable function ,
Theorem 4 demonstrates that as long as and
are not too dissimilar, the variance of the 1NN measure is bounded by the discrepancy between the first and second nearest neighbor interpolants.
We will appeal to the Efron-Stein inequality, which states the following: Let be an iid copy of and , then for any function
Let and denote as the 1NN measure formed from the data, . Due to exchangeability,
Let and denote the 1NN within and the 1NN measure formed from the reduced data . We have that
In order for to differ from it must be that and . Thus,
Let be the Radon-Nikodym derivative. Considering this term,
We conclude by noting that the intergral in the left term is the Chi-squared divergence. ∎
Corollary 2 (1NN measure variance).
Under Assumption 2, we have that
3.1 Partially missing data
Recall the missing data setting, where we estimate the unobserved functional with . Throughout this section, we will focus on the convergence of our functional in the limit as . Recall that and any assumptions that are made for must hold for this function.
This result suggests that one can approximate any valid query with a 1NN weighted version of the non-missing data, and have the result be consistent. When the query is only a function of the response variable then (2) is equivalent to imputing the response variable with its 1NN. Despite these promising results they are restricted to functionals that are of this form. Also, using 1NN imputation when the missing data is very distributionally dissimilar to the non-missing data can invalidate the asympototic results.
3.2 Dealing with covariate shift
Let’s apply these results to assessing and adapting to covariate shift. While detecting covariate shift with a two sample test is an important first step to addressing bias in training, it would perhaps be more informative to estimate the resulting test error for the shifted dataset. A change in covariate distribution will not result in poor prediction performance if the shift occurs in a way that does not change the predictions or response distribution. In classification and regression under covariate shift, we will consider the outlined two objectives: assessing the impact of covariate shift with test loss imputation, and domain adaptation with a 1NN imputed training set.
We fit on a training set, compute the validation losses, and estimate the test error with 1NN imputation obtaining . Because is fixed, we can apply Corollary 3 to this setting. For a predictor, , we let then the main condition is . When is bounded, then this holds in the classification setting, and for squared error loss with noise that has finite fourth moment.
Suppose that the losses have finite conditional variance, for almost all , and that the test set and training set covariate measures, , satisfy Assumptions 1, 2. The 1NN imputation of the test losses from the validation set is consistent for test risk (recall that is a test point),
in probability as (the validation and test set increase).
Domain adaptation requires us to augment the training process when we have detected that covariate shift has taken place. We naturally, replace the empirical measure of the training set with the 1NN measure. To prove that this process is correct, we will require a uniform law of large numbers for the 1NN measure. It is certain that such a uniform law will not hold for all function classes, as is the case in empirical process theory. A (weak)1NN Glivenko-Cantelli class (1NN-GC) (for measures ) is a function class such that
in probability as the number of samples from increase.
For domain adapted training, we impute the test points () with the training points (), then we can compute a 1NN empirical risk
As in empirical risk minimization, we select the classifier that minimizes this functional, . As before, we use , and denote as the class of generated by . In this paper we will not address the portion of the risk that is dependent on . We anticipate that this will require uniform entropy conditions on the class of losses, which would arise from VC theory, for example. Instead, let’s focus on the excess risk that is a direct result of the 1NN imputation,
The RHS is bounded by the quantity in (6), so if is a 1NN-GC class then this component of the excess risk is diminishing.
We can see that classes that are bracketing are 1NN-GC classes. Specifically, let an -bracket be a set of -measurable functions such that and for all we have that almost surely for some . Let be the minimal such that there exists an -bracket.
Suppose that for any , then is (weak) 1NN-Glivenko Cantelli.
The 1NN measure is a natural object to study as it can be applied to any situation in which we want to approximate the action of one distribution with samples from another.
We anticipate that this result will be useful in other contexts.
We have proven that 1NN imputation, as described in (2), is a consistent estimator for moments over the missing population.
Our imputation method however, is designed to be used as an importance weighting scheme for computing moments, and is not necessarily applicable to other population quantities (such as the mode).
Imputation can be a fraught activity, because releasing imputed data may lead researchers to compute statistics that are not necessarily valid.
We also do not estimate the imputation variance in any way in this paper, and we reserve this for future work.
We have only touched on the use of 1NN imputation for domain adaptation, and a complete theoretical and empirical analysis of this method remains to be done.
Acknowledgements: JS is supported by NSF DMS 1712996.
-  Rebecca R Andridge and Roderick JA Little. A review of hot deck imputation for survey non-response. International statistical review, 78(1):40–64, 2010.
-  Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In International Conference on Machine Learning, pages 1870–1879, 2015.
-  Gérard Biau and Luc Devroye. Lectures on the nearest neighbor method. Springer, 2015.
-  Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137–2155, 2009.
-  Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 3437–3445, 2014.
-  Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.
-  Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967.
-  Luc Devroye, László Györfi, Gábor Lugosi, and Harro Walk. On the measure of voronoi cells. Journal of Applied Probability, 54(2):394–408, 2017.
-  Luc Devroye, Lászlo Györfi, Gábor Lugosi, Harro Walk, et al. A nearest neighbor estimate of the residual variance. Electronic Journal of Statistics, 12(1):1752–1778, 2018.
-  Luc P Devroye and Terry J Wagner. The strong uniform consistency of nearest neighbor density estimates. The Annals of Statistics, pages 536–540, 1977.
-  Wayne A Fuller and Jae Kwang Kim. Hot deck imputation for the response model. Survey Methodology, 31(2):139, 2005.
-  Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. 2009.
-  Juha Heinonen. Lectures on analysis on metric spaces. Springer Science & Business Media, 2012.
-  Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
-  Aryeh Kontorovich and Roi Weiss. A bayes consistent 1-nn classifier. In Artificial Intelligence and Statistics, pages 480–488, 2015.
-  Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Improving propensity score weighting using machine learning. Statistics in medicine, 29(3):337–346, 2010.
-  Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 333. John Wiley & Sons, 2014.
YP Mack and Murray Rosenblatt.
Multivariate k-nearest neighbor density estimates.
Journal of Multivariate Analysis, 9(1):1–15, 1979.
-  Daniel F McCaffrey, Greg Ridgeway, and Andrew R Morral. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological methods, 9(4):403, 2004.
-  Dávid Pál, Barnabás Póczos, and Csaba Szepesvári. Estimation of rényi entropy and mutual information based on generalized nearest-neighbor graphs. In Advances in Neural Information Processing Systems, pages 1849–1857, 2010.
-  Marie Reilly. Data analysis using hot deck multiple imputation. The Statistician, pages 307–313, 1993.
-  Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
-  Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
-  Charles J Stone. Consistent nonparametric regression. The annals of statistics, pages 595–620, 1977.
-  Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
-  Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
-  Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827, 2013.
Proof of Cor. 2.
Let be the following weight for point :
We have that
Notice that is the weight for the 2-nearest neighbors regressor. And that
by Lemma 10.2 since conditions (i)-(iii) are regarding which is the same for -nearest neighbors. ∎
Proof of Cor. 3.
We can take the limit as and apply the law of large numbers,
where are fresh draws from the joint distribution. The right hand side has conditional variance,
Then we have that
By Theorem 3 we have that this expectation is bounded by
and by Theorem 2. This term vanishes. Hence, almost everywhere. What remains is to show that the conditional expectation is converging to the desired limit. But this follows immediately from the fact that , and Theorem 1. ∎
Proof of Cor. 5.
Let , then for fixed ,
By letting we have that . Thus, because was arbitrary, we have our result. ∎
Appendix B Proof of Theorem 3
Theorem 5 (Lebesgue Differentiation Theorem for Doubling Measures ).
Let be a doubling measure and be a measurable function. For -almost all points ,
Proof of Theorem 3.
Throughout, let be some constant. Let and notice that