# Design-unbiased statistical learning in survey sampling

Design-consistent model-assisted estimation has become the standard practice in survey sampling. However, a general theory is lacking so far, which allows one to incorporate modern machine-learning techniques that can lead to potentially much more powerful assisting models. We propose a subsampling Rao-Blackwell method, and develop a statistical learning theory for exactly design-unbiased estimation with the help of linear or non-linear prediction models. Our approach makes use of classic ideas from Statistical Science as well as the rapidly growing field of Machine Learning. Provided rich auxiliary information, it can yield considerable efficiency gains over standard linear model-assisted methods, while ensuring valid estimation for the given target population, which is robust against potential mis-specifications of the assisting model at the individual level.

## Authors

• 2 publications
• 12 publications
03/11/2019

### Shapley regressions: A framework for statistical inference on machine learning models

Machine learning models often excel in the accuracy of their predictions...
04/08/2020

### Incidence weighting estimation under bipartite incidence graph sampling

Bipartite incidence graph sampling provides a unified representation of ...
12/15/2017

### Automated Selection of Post-Strata using a Model-Assisted Regression Tree Estimator

Auxiliary information can increase the efficiency of survey estimators t...
05/02/2019

### A Conditional Empirical Likelihood Based Method for Model Parameter Estimation from Complex survey Datasets

We consider an empirical likelihood framework for inference for a statis...
04/16/2019

### Bayesian Mixed Effects Model Estimation under Informative Sampling

When random effects are correlated with the response variable of interes...
12/14/2020

### Model-assisted estimation in high-dimensional settings for survey data

Model-assisted estimators have attracted a lot of attention in the last ...
05/31/2017

### The ALAMO approach to machine learning

ALAMO is a computational methodology for leaning algebraic functions fro...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Approximately design-unbiased model-assisted estimation is not new. It has become the standard practice in survey sampling, following many influential works such as Särndal et al. (1992), Deville and Särndal (1992). However, there lacks so far a theory, which allows one to generally incorporate the many common machine-learning (ML) techniques. For instance, according to Breit and Opsomer (2017, p. 203), they “are not aware of direct uses of random forests in a model-assisted survey estimator”. Since modern ML techniques can often generate more flexible and powerful prediction models, when rich auxiliary feature data are available, the potentials are worth exploring, in any situation where the practical advantages of linear weighting are not essential compared to the efficiency gains that can be achieved by alternative non-linear ML techniques.

We propose a subsampling Rao-Blackwell (SRB) method, which enables exactly design-unbiased estimation with the help of linear or non-linear prediction models. Monte Carlo (MC) versions of the proposed method can be used in cases where exact RB method is computationally too costly. The MC-SRB method is still exactly design-unbiased, despite it is somewhat less efficient due to the additional MC error. In practice, though, one can easily balance between the numerical efficiency of the MC-SRB method against the statistical efficiency of the corresponding exact RB method.

The SRB method makes use of three classic ideas from Statistical Science and Machine Learning. On the one hand, the training-test split of the sample of observations in ML generates errors in the test set rather than residuals, conditional on the training dataset, which as we shall explain is the key to achieving exact design-unbiasedness. For model-assisted survey estimation we use this idea to remove the finite-sample bias. On the other hand, Rao-Blackwellisation (Rao, 1945; Blackwell, 1947) and model-assisted estimation (Cassel et al., 1976) are powerful ideas in Statistics and survey sampling, which we apply to ML techniques to obtain design-unbiased survey estimators at the population level.

We shall refer to the amalgamation as statistical learning

, since the term model-assisted estimation is entrenched with the property of approximate design-unbiasedness (e.g. Särndal 2010; Breit and Opsomer, 2017), whereas the focus of population-level estimation and associated variance estimation is unusual in the ML literature.

In applications one needs to ensure design-consistency of the proposed SRB method, in addition to exact design-unbiasedness. The property can readily be established for parametric or many semi-parametric assisting models. But the conditions required for non-parametric algorithmic ML prediction models have so far eluded a treatment in the literature. Indeed, this has been a main reason preventing the incorporation of such ML techniques in model-assisted estimation from survey sampling. We shall develop general stability conditions for design-consistency under both simple random sampling and arbitrary unequal probability sampling designs.

For the first time, design-unbiased model-assisted estimation can thereby be achieved generally in survey sampling. Wherever rich feature data are available, the approach of statistical learning developed in this paper enables one to adopt suitable ML techniques, which can make much more efficient use of the available auxiliary information.

The rest of the paper is organised as follows. In Section 2, we describe the SRB method that uses an assisting linear model. The underlying ideas of design-unbiased statistical learning are explained, as well as the differences to the standard model-assisted generalised regression estimation. Some basic methods of variance estimation are outlined, where a novel jackknife variance estimator is developed for the SRB method. We move on to non-linear ML techniques in Section (3). The similarity and difference to the bootstrap aggregating (Breiman, 1996b) approach are explored. Moreover, we investigate and prove the stability conditions for design-consistency of SRB method that uses non-parametric algorithmic prediction models. Two simulation studies are presented in Section 4, which illustrate the potential gains of the proposed unbiased statistical learning approach, compared to standard linear model-assisted or model-based approaches. A brief summary and topics for future research will be given in Section 5.

## 2 Unbiased linear estimation

In this section we consider unbiased linear estimation in survey sampling, which builds on generalised regression (GREG) estimation (Särndal et al. 1992). The GREG estimator is the most common estimation method in practical survey sampling. It is consistent under mild regularity conditions, and is often more efficient than exactly unbiased Horvitz-Thompson (HT) estimation (Horvitz and Thompson, 1952). The proposes subsampling Rao-Blackwellisation (SRB) method removes the finite-sample bias of GREG generally, whose relative efficiency is comparable to the standard GREG estimator.

### 2.1 Bias correction by subsampling

Let be a sample (of size ) selected from the population of size , with probability , where over all possible samples under a given sampling design. Let be the sample inclusion probability, for each . Let be a survey variable, for , with unknown population total .

Let the assisting linear model expectation of be given by , where

is the vector of covariates for each

. Let be the estimator of , where is a weighted least squares (WLS) estimator of

. It is possible to attach additional heteroscedasticity weights in the WLS; but the development below is invariant to such variations, so that it is more convenient to simply ignore it in the notation. Let

. The GREG estimator of is given as

 ˆYGR=X⊤b+∑i∈s(yi−x⊤ib)/πi

While is design-consistent under mild regularity conditions (e.g. Särndal et al. 1992), as , it is usually biased given finite sample size , except in special cases such as when and , where and .

To remove the potential finite-sample bias of , consider subsampling of , with known probability , such as SRS with fixed , where . The induced probability of selecting from is given by

 p1(s1)=∑s:s1⊂sq(s1|s)p(s)

where is the corresponding inclusion probability for . Let be the complement of in . Let the conditional sampling probability of given be

 p2(s2|s1)=p(s)q(s1|s)/p1(s1)

and let be the corresponding conditional inclusion probability in for . Let be the estimate of based on the sub-sample , where . Let

 ˆY1=∑i∈s1yi+∑i∈U∖s1x⊤ib1+∑i∈s2(yi−x⊤ib1)/π2i (1)

In other words, it is the sum of in and a difference estimator of the remaining population total based on , via that does not depend on the observations in .

##### Proposition

The estimator is conditionally unbiased for over given , denoted by , as well as unconditionally over , denoted by .

Proof: As is fixed for any given , the last two terms on the right-hand side of (1) is unbiased for given . It follows that is conditionally unbiased for given ; hence, design-unbiased over unconditionally as well.

##### Example: Simple random sampling (SRS)

Suppose SRS without replacement of from , and from with fixed size , such that and . In the special case of , is the sample mean in , and

 ˆY1=n1b1+(N−n1)∑i∈s2yi/(n−n1)

which amounts to using the sample mean in to estimate the population mean outside of the given , i.e., instead of using the sample mean in for the whole population mean. Thus, achieves unbiasedness generally, but at a cost of increased variance.

### 2.2 Rao-Blackwellisation

One can reduce the variance of by the Rao-Blackwell method (Rao, 1945; Blackwell, 1947). The minimal sufficient statistic in the finite population sampling setting is simply . Applying the RB method to by (1) yields , which is given by the conditional expectation of given , i.e.

 ˆY∗GR=Eq(ˆY1|Ds)=Eq(ˆY1|s) (2)

where the expectation is evaluated with respect to , and the second expression is leaner as long as one keeps in mind that are treated as fixed constants associated with the distinctive units.

##### Proposition

The estimator is design-unbiased for , denoted by .

Proof: By construction, the combined randomisation distribution induced by and is the same as that induced by and , for any and . Thus,

 E(ˆY∗GR)=Ep(ˆY∗GR)=Ep(Eq(ˆY1|s))=E1(E2(ˆY1|s1))=E1(Y)=Y□

Next, for the variance of over , i.e. , we notice

 V(ˆY1) =Ep(Vq(ˆY1|s))+Vp(Eq(ˆY1|s))=Ep(Vq(ˆY1|s))+Vp(ˆY∗GR) V(ˆY1) =E1(V2(ˆY1|s1))+V1(E2(ˆY1|s1))=E1(V2(ˆY1|s1))

since . Juxtaposing the two expressions of above, we obtain

 V(ˆY∗GR)=Vp(ˆY∗GR)=E1(V2(ˆY1|s1))−Ep(Vq(ˆY1|s)) (3)

where is the variance reduction compared to .

##### Proposition

Provided unbiased variance estimator with respect to , i.e. , a design-unbiased variance estimator for is given by

 ˆV(ˆY∗GR)=ˆV(ˆY1)−Vq(ˆY1|s)

Proof: By stipulation, we have , which is the first term on the right-hand side of (3). The result follows immediately.

##### Example: SRS, cont’d

In the special case of and , we have

 ˆY1(i)=n1¯y(i)+(N−n1)yi

if and denotes the mean in . The RB estimator follows as

 ˆY∗GR=1n∑i∈sˆY1(i)=n1n∑i∈s¯y(i)+Nn∑i∈syi−n1n∑i∈syi=Nn∑i∈syi

which is the usual unbiased full-sample expansion estimator in this case. The RB method thus recovers the lost efficiency of any on its own.

Let , and . To express as a linear combination of , we rewrite as

 ˆY1 =∑i∈s1yi+∑i∈s2yiπ2i+(X−X1)⊤b1−∑i∈s2x⊤ib1π2i =∑i∈s1yi+∑i∈s2yiπ2i+(Xc1−ˆXc1)⊤b1=∑i∈swiyi

where

 wi={1+(Xc1−ˆXc1)⊤(∑i∈s1xix⊤i/π1i)−1xi/π1iif i∈s11/π2iif i∈s2

It follows that the RB estimator (2) can be given as a linear estimator

 ˆY∗GR=∑i∈sw∗iyiwherew∗i=Eq(wi|s) (4)

This has an important practical advantage that can be applied to produce numerically consistent cross-tabulation of multiple survey variables of interest.

In the case of SRS of with , the RB weight in (4) is the average of ’s over possible subsamples , for a given unit , where when does not include the unit , otherwise is the corresponding GREG weight for , which is different for each of the rest subsamples that includes the unit .

### 2.3 Relative efficiency to GREG

Let and for . Expanding the GREG estimator around yields

 ˆYGR≈∑i∈seiπi+B⊤X

For , the first two terms on the right-hand side of (1) becomes if there exists a vector such that , in which case is a function of , i.e.

 ˆY1=X⊤b1+ˆYc1−b⊤1ˆXc1=ˆYc1+b⊤1(X−ˆXc1)

where is conditionally unbiased for given , and similarly for . Let and . We have , since and aim at the same population parameter, especially if is close to . In any case, expanding around yields

 ˆY1 ≈(Yc++B⊤(X−Xc+))+((ˆYc1−Yc+)+(b1−B)⊤(X−Xc+)−B⊤(ˆXc1−Xc+)) =(ˆYc1−B⊤ˆXc1)+b⊤1(X−Xc+)+B⊤Xc+

and

 ˆYc1−B⊤ˆXc1=∑i∈s2ei/π2i=∑i∈sδ∗2iei/π2i

where if and 0 of . Thus, we obtain

 ˆY∗GR=Eq(ˆY1|s)≈∑i∈sEq(δ∗2iπ2i|s)πieiπi+Eq(b1|s)⊤(X−Xc+)+B⊤Xc+ (5)

Notice that is a constant. Thus, compared to , the variance of involves that of in addition. As , the first term on the right-hand side of (5) is provided , whereas the second term is if provided the usual regularity conditions for GREG. As long as the sampling fraction is small, the first term will dominate, in which case the variance of the RB estimator is of the same order as that of the GREG estimator .

##### Example: SRS, cont’d

Let , where . We have

 πiπ−12iEq(δ∗2i|s)=nN⋅N−n1k⋅(n−1)!(k−1)!(n−k)!⋅k!(n−k)!n!=1−n1N

Let be the population variance of . The variance of the first-term in (5) is

 Vp(∑i∈sEq(δ∗2iπ2i|s)πieiπi)=N2(1−nN)S2en(1−n1N)2

which is actually smaller than the approximate variance of the GREG estimator under SRS, although the difference will not be noteworthy in practical terms, if the sampling fraction is small, since . Meanwhile, due to the additional variance of , the estimator by unbiased RB method can possibly have a larger variance than the biased GREG (with general ). It seems that one should use large if possible, to keep the additional variance due to small.

### 2.4 Delete-one RB method

The largest possible size of is . We refer to Rao-Blackwellisation based on SRS of with as the delete-one (or leave-one-out, LOO) RB method. The conditional sampling design is not measurable in this case, in that one cannot have an unbiased variance estimator based on a single observation in . For an approximate variance estimator, we reconsider the basic case where form a sample of independent and identically distributed (IID) observations, in order to develop an analogy to the classic jackknife variance estimation (Tukey, 1958).

Denote by the population mean that is also the expectation of each , for . As before, let denote the mean in the subsample . Following (1), let

 ^θ(j)=n−1N¯y(j)+(1−n−1N)yj

be the delete- estimator of , where acts as an unbiased estimator of the population mean outside . The RB method yields the whole sample mean, denoted by

 ^θ∗=1nn∑j=1^θ(j)=1nn∑j=1yj=¯y

Observe that we have , where

 z(j)=1N−n(N^θ(j)−n^θ∗)=yj (6)

Thus, the RB estimator is the mean of an IID sample of observations , for , as in the development of classic jackknife variance estimation, so that we obtain

 ˆV(^θ∗)=1n(n−1)n∑j=1(z(j)−^θ∗)2

Notice that, in this case, the IID observations used for the classic development of jackknife method are given by instead of (6), where .

For the delete-one RB method based on (1) and (2) given auxiliary , we have , such that the estimator can be denoted by , based on , where it is simply the delete- jackknife regression coefficients estimator. Rewrite the corresponding population total estimator by (1) as

 ˆY(j)=X⊤b(j)+yjπ2j−x⊤jb(j)π2j

such that the RB method yields by (2), as the mean of over . We propose a jackknife variance estimator for , given by

 ˆV(ˆY∗GR)=N2n(n−1)n∑j=1(z(j)−1nn∑i=1z(i))2 (7)

where

 z(j)=1N−n(ˆY(j)−nNˆY∗GR)

Notice that it may be the case under general unequal probability sampling that the conditional inclusion probability given is not exactly known. However, in many situations where the sampling fraction is low, it is reasonable that

 π2j≈πj∑i∉s1πi=πjn−∑i∈s1πi≈πjn(1−n1/N)

An approximate delete-one RB estimator following (2) can then be given as

 ˜Y∗GR=X⊤n∑j=1b(j)n+(1−n1N)n∑j=1yjπj−(1−n1N)n∑j=1x⊤jπjb(j) (8)

with for jackknife variance estimation on replacing by . Meanwhile, the delete-one jackknife replicates of GREG can be written as

 ˆY(j)GR=X⊤b(j)+nn−1(∑i≠jyiπi−∑i≠jx⊤ib(j)πi) ˆY(⋅)GR=1nn∑j=1ˆY(j)GR=X⊤n∑j=1b(j)n+n∑i=1yiπi−n∑i=1x⊤iπi(∑j≠ib(j)n−1)

The estimator is quite close to the approximate RB-estimator (8); indeed, identical apart from in the special case of . This is not surprising, since the jackknife-based is an alternative for reducing the bias of the GREG estimator. The difference is that, provided is known, the proposed RB method will be exactly design-unbiased, but not the jackknife-based . Finally, the resemblance between and is another indication that the relative efficiency of the delete-one RB method is usually not a concern compared to the standard GREG estimator .

### 2.5 Monte Carlo RB

Exact Rao-Blackwellisation can be computationally expensive, when the cardinality of the subsample space (of ) is large. Instead of calculating the RB estimator exactly, consider the Monte Carlo (MC) RB estimator given as follows:

 ˆYKGR=K−1K∑k=1ˆY1k (9)

where is the estimator based on the th subsample, for , which are realisations of from , such that is a Monte Carlo approximation of .

##### Proposition

The estimator is design-unbiased for , denoted by .

Proof: The result follows from .

Adopting a computationally manageable entails an increase of variance, i.e. , compared to , so that the variance of is given by

 V(ˆYKGR)=E1(V2(ˆY1|s1))−Ep(Vq(ˆY1|s))+Ep(Vq(ˆYKGR|s)) (10)

Due to the IID construction of , an unbiased estimator of is given by

 ˆVq(ˆYKGR|s)=1K(K−1)K∑k=1(ˆY1k−ˆYKGR)2

This allows one to control the statistical efficiency of the MC-RB method, i.e. the choice of is acceptable when is deemed small enough in practical terms.

##### Proposition

Provided unbiased variance estimator with respect to , i.e. , a design-unbiased variance estimator for is given by

 ˆV(ˆYKGR)=1KK∑k=1ˆV(ˆY1k)−1KK∑k=1(ˆY1k−ˆYKGR)2

Proof: Due to the IID construction of , is an unbiased estimator of the first term on the right-hand side of (10), while is an unbiased estimator of the second term. The result follows.

Finally, for the delete-one RB method, where unbiased variance estimator is not available now that , a practical option is to first apply the jackknife variance estimator (7) to the samples, as if where the exact RB estimator , and then add to it the extra term for the additional Monte Carlo error. This would allow one to use the Monte Carlo delete-one RB method in general.

## 3 Unbiased non-linear learning

In this section we consider design-unbiased estimation in survey sampling, which builds on arbitrary ML technique that can be non-linear as well as non-parametric.

### 3.1 Design-unbiased ML for survey sampling

Denote by the model or algorithm that aims to predict given . Let be the training set, and the test set. Let be the trained model based on , yielding as the corresponding -predictor of given . Apply the trained model to yields the prediction errors of conditional on , denoted by . In contrast, the same discrepancy is referred to as the residuals of , when it is calculated for , denoted by , including when the training set is equal to . In standard ML, the errors in the test set are used to select different trained algorithms, or to assess how well a trained algorithm can be expected to perform when applied to the units with unknown ’s.

From an inference point of view, a basic problem with the standard ML approach above arises because one needs to be able to ‘extrapolate’ the information in to the units outside

, in order for supervised learning to have any value at all. This is simply because

are all observed and prediction in any form is unnecessary for . No matter how the training-test split is carried out, one cannot ensure valid for , unless is selected from the entire reference set of units, i.e. the population , in some non-informative (or representative) manner. This is the well-known problem of observational studies in statistical science, which is sometimes recast as the problem of concept drift in the ML literature (e.g. Tsymbal, 2004).

A design -unbiased approach to M-assisted estimation of population total can be achieved with respective to

• a probability sample from , with probability , and

• a probabilistic scheme for the training-test split given .

Explicitly, let be the estimator of obtained from the realised sample and subsample given the model . It is said to be design -unbiased for , provided

 Epq(ˆY1M)=∑sp