Design-unbiased statistical learning in survey sampling

by   Luis Sanguiao Sande, et al.

Design-consistent model-assisted estimation has become the standard practice in survey sampling. However, a general theory is lacking so far, which allows one to incorporate modern machine-learning techniques that can lead to potentially much more powerful assisting models. We propose a subsampling Rao-Blackwell method, and develop a statistical learning theory for exactly design-unbiased estimation with the help of linear or non-linear prediction models. Our approach makes use of classic ideas from Statistical Science as well as the rapidly growing field of Machine Learning. Provided rich auxiliary information, it can yield considerable efficiency gains over standard linear model-assisted methods, while ensuring valid estimation for the given target population, which is robust against potential mis-specifications of the assisting model at the individual level.



There are no comments yet.


page 1

page 2

page 3

page 4


Shapley regressions: A framework for statistical inference on machine learning models

Machine learning models often excel in the accuracy of their predictions...

Incidence weighting estimation under bipartite incidence graph sampling

Bipartite incidence graph sampling provides a unified representation of ...

Automated Selection of Post-Strata using a Model-Assisted Regression Tree Estimator

Auxiliary information can increase the efficiency of survey estimators t...

A Conditional Empirical Likelihood Based Method for Model Parameter Estimation from Complex survey Datasets

We consider an empirical likelihood framework for inference for a statis...

Bayesian Mixed Effects Model Estimation under Informative Sampling

When random effects are correlated with the response variable of interes...

Model-assisted estimation in high-dimensional settings for survey data

Model-assisted estimators have attracted a lot of attention in the last ...

The ALAMO approach to machine learning

ALAMO is a computational methodology for leaning algebraic functions fro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Approximately design-unbiased model-assisted estimation is not new. It has become the standard practice in survey sampling, following many influential works such as Särndal et al. (1992), Deville and Särndal (1992). However, there lacks so far a theory, which allows one to generally incorporate the many common machine-learning (ML) techniques. For instance, according to Breit and Opsomer (2017, p. 203), they “are not aware of direct uses of random forests in a model-assisted survey estimator”. Since modern ML techniques can often generate more flexible and powerful prediction models, when rich auxiliary feature data are available, the potentials are worth exploring, in any situation where the practical advantages of linear weighting are not essential compared to the efficiency gains that can be achieved by alternative non-linear ML techniques.

We propose a subsampling Rao-Blackwell (SRB) method, which enables exactly design-unbiased estimation with the help of linear or non-linear prediction models. Monte Carlo (MC) versions of the proposed method can be used in cases where exact RB method is computationally too costly. The MC-SRB method is still exactly design-unbiased, despite it is somewhat less efficient due to the additional MC error. In practice, though, one can easily balance between the numerical efficiency of the MC-SRB method against the statistical efficiency of the corresponding exact RB method.

The SRB method makes use of three classic ideas from Statistical Science and Machine Learning. On the one hand, the training-test split of the sample of observations in ML generates errors in the test set rather than residuals, conditional on the training dataset, which as we shall explain is the key to achieving exact design-unbiasedness. For model-assisted survey estimation we use this idea to remove the finite-sample bias. On the other hand, Rao-Blackwellisation (Rao, 1945; Blackwell, 1947) and model-assisted estimation (Cassel et al., 1976) are powerful ideas in Statistics and survey sampling, which we apply to ML techniques to obtain design-unbiased survey estimators at the population level.

We shall refer to the amalgamation as statistical learning

, since the term model-assisted estimation is entrenched with the property of approximate design-unbiasedness (e.g. Särndal 2010; Breit and Opsomer, 2017), whereas the focus of population-level estimation and associated variance estimation is unusual in the ML literature.

In applications one needs to ensure design-consistency of the proposed SRB method, in addition to exact design-unbiasedness. The property can readily be established for parametric or many semi-parametric assisting models. But the conditions required for non-parametric algorithmic ML prediction models have so far eluded a treatment in the literature. Indeed, this has been a main reason preventing the incorporation of such ML techniques in model-assisted estimation from survey sampling. We shall develop general stability conditions for design-consistency under both simple random sampling and arbitrary unequal probability sampling designs.

For the first time, design-unbiased model-assisted estimation can thereby be achieved generally in survey sampling. Wherever rich feature data are available, the approach of statistical learning developed in this paper enables one to adopt suitable ML techniques, which can make much more efficient use of the available auxiliary information.

The rest of the paper is organised as follows. In Section 2, we describe the SRB method that uses an assisting linear model. The underlying ideas of design-unbiased statistical learning are explained, as well as the differences to the standard model-assisted generalised regression estimation. Some basic methods of variance estimation are outlined, where a novel jackknife variance estimator is developed for the SRB method. We move on to non-linear ML techniques in Section (3). The similarity and difference to the bootstrap aggregating (Breiman, 1996b) approach are explored. Moreover, we investigate and prove the stability conditions for design-consistency of SRB method that uses non-parametric algorithmic prediction models. Two simulation studies are presented in Section 4, which illustrate the potential gains of the proposed unbiased statistical learning approach, compared to standard linear model-assisted or model-based approaches. A brief summary and topics for future research will be given in Section 5.

2 Unbiased linear estimation

In this section we consider unbiased linear estimation in survey sampling, which builds on generalised regression (GREG) estimation (Särndal et al. 1992). The GREG estimator is the most common estimation method in practical survey sampling. It is consistent under mild regularity conditions, and is often more efficient than exactly unbiased Horvitz-Thompson (HT) estimation (Horvitz and Thompson, 1952). The proposes subsampling Rao-Blackwellisation (SRB) method removes the finite-sample bias of GREG generally, whose relative efficiency is comparable to the standard GREG estimator.

2.1 Bias correction by subsampling

Let be a sample (of size ) selected from the population of size , with probability , where over all possible samples under a given sampling design. Let be the sample inclusion probability, for each . Let be a survey variable, for , with unknown population total .

Let the assisting linear model expectation of be given by , where

is the vector of covariates for each

. Let be the estimator of , where is a weighted least squares (WLS) estimator of

. It is possible to attach additional heteroscedasticity weights in the WLS; but the development below is invariant to such variations, so that it is more convenient to simply ignore it in the notation. Let

. The GREG estimator of is given as

While is design-consistent under mild regularity conditions (e.g. Särndal et al. 1992), as , it is usually biased given finite sample size , except in special cases such as when and , where and .

To remove the potential finite-sample bias of , consider subsampling of , with known probability , such as SRS with fixed , where . The induced probability of selecting from is given by

where is the corresponding inclusion probability for . Let be the complement of in . Let the conditional sampling probability of given be

and let be the corresponding conditional inclusion probability in for . Let be the estimate of based on the sub-sample , where . Let


In other words, it is the sum of in and a difference estimator of the remaining population total based on , via that does not depend on the observations in .


The estimator is conditionally unbiased for over given , denoted by , as well as unconditionally over , denoted by .

Proof: As is fixed for any given , the last two terms on the right-hand side of (1) is unbiased for given . It follows that is conditionally unbiased for given ; hence, design-unbiased over unconditionally as well.

Example: Simple random sampling (SRS)

Suppose SRS without replacement of from , and from with fixed size , such that and . In the special case of , is the sample mean in , and

which amounts to using the sample mean in to estimate the population mean outside of the given , i.e., instead of using the sample mean in for the whole population mean. Thus, achieves unbiasedness generally, but at a cost of increased variance.

2.2 Rao-Blackwellisation

One can reduce the variance of by the Rao-Blackwell method (Rao, 1945; Blackwell, 1947). The minimal sufficient statistic in the finite population sampling setting is simply . Applying the RB method to by (1) yields , which is given by the conditional expectation of given , i.e.


where the expectation is evaluated with respect to , and the second expression is leaner as long as one keeps in mind that are treated as fixed constants associated with the distinctive units.


The estimator is design-unbiased for , denoted by .

Proof: By construction, the combined randomisation distribution induced by and is the same as that induced by and , for any and . Thus,

Next, for the variance of over , i.e. , we notice

since . Juxtaposing the two expressions of above, we obtain


where is the variance reduction compared to .


Provided unbiased variance estimator with respect to , i.e. , a design-unbiased variance estimator for is given by

Proof: By stipulation, we have , which is the first term on the right-hand side of (3). The result follows immediately.

Example: SRS, cont’d

In the special case of and , we have

if and denotes the mean in . The RB estimator follows as

which is the usual unbiased full-sample expansion estimator in this case. The RB method thus recovers the lost efficiency of any on its own.

Let , and . To express as a linear combination of , we rewrite as


It follows that the RB estimator (2) can be given as a linear estimator


This has an important practical advantage that can be applied to produce numerically consistent cross-tabulation of multiple survey variables of interest.

In the case of SRS of with , the RB weight in (4) is the average of ’s over possible subsamples , for a given unit , where when does not include the unit , otherwise is the corresponding GREG weight for , which is different for each of the rest subsamples that includes the unit .

2.3 Relative efficiency to GREG

Let and for . Expanding the GREG estimator around yields

For , the first two terms on the right-hand side of (1) becomes if there exists a vector such that , in which case is a function of , i.e.

where is conditionally unbiased for given , and similarly for . Let and . We have , since and aim at the same population parameter, especially if is close to . In any case, expanding around yields


where if and 0 of . Thus, we obtain


Notice that is a constant. Thus, compared to , the variance of involves that of in addition. As , the first term on the right-hand side of (5) is provided , whereas the second term is if provided the usual regularity conditions for GREG. As long as the sampling fraction is small, the first term will dominate, in which case the variance of the RB estimator is of the same order as that of the GREG estimator .

Example: SRS, cont’d

Let , where . We have

Let be the population variance of . The variance of the first-term in (5) is

which is actually smaller than the approximate variance of the GREG estimator under SRS, although the difference will not be noteworthy in practical terms, if the sampling fraction is small, since . Meanwhile, due to the additional variance of , the estimator by unbiased RB method can possibly have a larger variance than the biased GREG (with general ). It seems that one should use large if possible, to keep the additional variance due to small.

2.4 Delete-one RB method

The largest possible size of is . We refer to Rao-Blackwellisation based on SRS of with as the delete-one (or leave-one-out, LOO) RB method. The conditional sampling design is not measurable in this case, in that one cannot have an unbiased variance estimator based on a single observation in . For an approximate variance estimator, we reconsider the basic case where form a sample of independent and identically distributed (IID) observations, in order to develop an analogy to the classic jackknife variance estimation (Tukey, 1958).

Denote by the population mean that is also the expectation of each , for . As before, let denote the mean in the subsample . Following (1), let

be the delete- estimator of , where acts as an unbiased estimator of the population mean outside . The RB method yields the whole sample mean, denoted by

Observe that we have , where


Thus, the RB estimator is the mean of an IID sample of observations , for , as in the development of classic jackknife variance estimation, so that we obtain

Notice that, in this case, the IID observations used for the classic development of jackknife method are given by instead of (6), where .

For the delete-one RB method based on (1) and (2) given auxiliary , we have , such that the estimator can be denoted by , based on , where it is simply the delete- jackknife regression coefficients estimator. Rewrite the corresponding population total estimator by (1) as

such that the RB method yields by (2), as the mean of over . We propose a jackknife variance estimator for , given by



Notice that it may be the case under general unequal probability sampling that the conditional inclusion probability given is not exactly known. However, in many situations where the sampling fraction is low, it is reasonable that

An approximate delete-one RB estimator following (2) can then be given as


with for jackknife variance estimation on replacing by . Meanwhile, the delete-one jackknife replicates of GREG can be written as

The estimator is quite close to the approximate RB-estimator (8); indeed, identical apart from in the special case of . This is not surprising, since the jackknife-based is an alternative for reducing the bias of the GREG estimator. The difference is that, provided is known, the proposed RB method will be exactly design-unbiased, but not the jackknife-based . Finally, the resemblance between and is another indication that the relative efficiency of the delete-one RB method is usually not a concern compared to the standard GREG estimator .

2.5 Monte Carlo RB

Exact Rao-Blackwellisation can be computationally expensive, when the cardinality of the subsample space (of ) is large. Instead of calculating the RB estimator exactly, consider the Monte Carlo (MC) RB estimator given as follows:


where is the estimator based on the th subsample, for , which are realisations of from , such that is a Monte Carlo approximation of .


The estimator is design-unbiased for , denoted by .

Proof: The result follows from .

Adopting a computationally manageable entails an increase of variance, i.e. , compared to , so that the variance of is given by


Due to the IID construction of , an unbiased estimator of is given by

This allows one to control the statistical efficiency of the MC-RB method, i.e. the choice of is acceptable when is deemed small enough in practical terms.


Provided unbiased variance estimator with respect to , i.e. , a design-unbiased variance estimator for is given by

Proof: Due to the IID construction of , is an unbiased estimator of the first term on the right-hand side of (10), while is an unbiased estimator of the second term. The result follows.

Finally, for the delete-one RB method, where unbiased variance estimator is not available now that , a practical option is to first apply the jackknife variance estimator (7) to the samples, as if where the exact RB estimator , and then add to it the extra term for the additional Monte Carlo error. This would allow one to use the Monte Carlo delete-one RB method in general.

3 Unbiased non-linear learning

In this section we consider design-unbiased estimation in survey sampling, which builds on arbitrary ML technique that can be non-linear as well as non-parametric.

3.1 Design-unbiased ML for survey sampling

Denote by the model or algorithm that aims to predict given . Let be the training set, and the test set. Let be the trained model based on , yielding as the corresponding -predictor of given . Apply the trained model to yields the prediction errors of conditional on , denoted by . In contrast, the same discrepancy is referred to as the residuals of , when it is calculated for , denoted by , including when the training set is equal to . In standard ML, the errors in the test set are used to select different trained algorithms, or to assess how well a trained algorithm can be expected to perform when applied to the units with unknown ’s.

From an inference point of view, a basic problem with the standard ML approach above arises because one needs to be able to ‘extrapolate’ the information in to the units outside

, in order for supervised learning to have any value at all. This is simply because

are all observed and prediction in any form is unnecessary for . No matter how the training-test split is carried out, one cannot ensure valid for , unless is selected from the entire reference set of units, i.e. the population , in some non-informative (or representative) manner. This is the well-known problem of observational studies in statistical science, which is sometimes recast as the problem of concept drift in the ML literature (e.g. Tsymbal, 2004).

A design -unbiased approach to M-assisted estimation of population total can be achieved with respective to

  • a probability sample from , with probability , and

  • a probabilistic scheme for the training-test split given .

Explicitly, let be the estimator of obtained from the realised sample and subsample given the model . It is said to be design -unbiased for , provided