 # A method to find an efficient and robust sampling strategy under model uncertainty

We consider the problem of deciding on sampling strategy, in particular sampling design. We propose a risk measure, whose minimizing value guides the choice. The method makes use of a superpopulation model and takes into account uncertainty about its parameters. The method is illustrated with a real dataset, yielding satisfactory results. As a baseline, we use the strategy that couples probability proportional-to-size sampling with the difference estimator, as it is known to be optimal when the superpopulation model is fully known. We show that, even under moderate misspecifications of the model, this strategy is not robust and can be outperformed by some alternatives

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of choosing strategy, in particular the design, for the estimation of the total of a study variable in a finite population when a set of auxiliary variables is available in a list sampling frame. We focus on the estimation of the total.

The decision about sampling strategy involves parameters which are unknown at the stage when the decision needs to be taken. After data collection the parameters can be estimated, although sometimes only under some assumptions. In practice, we often use data from previous waves of a repeated survey, frame variables or data from another survey that is similar to the one that under a planning stage. There is a risk that the available data do not give reliable information about relevant parameters. The method presented here involves a risk measure, which takes into account the possibility of being misled by inaccurate or incorrect beliefs about the values of the needed parameters. The risk measure is derived for the difference and the generalized regression estimators. Other than that, the measure is general. This measure and the discussion of its practical use are the main result of this paper.

One aim when selecting and devising the sampling strategy is efficiency in terms of small mean-squared error. The definition of “efficiency” is not unique, however, as it depends on the inference approach. Under the design-based approach, Godambe (1955), Lanke (1973) and Cassel et al. (1977) show that there is no uniformly best linear estimator, in the sense of being best for all populations. There is no best design either. Therefore, a traditional approach for defining the strategy has been to assume that the finite population is a realization of some superpopulation model. The strategy is then defined in such a way that it minimizes the model expected value of the design mean-squared error, a parameter called anticipated mean-squared error. The adjective “anticipated” was first introduced by Isaki and Fuller (1982) to emphasize the fact that this is a conceptual mean-squared error which is calculated in advance to sampling, based only on information available prior to sampling.

Assuming that a superpopulation model holds and its parameters are known, several authors have shown that the optimal strategy should make use of a probability proportional-to-size sampling design (e.g. Hájek, 1959; Cassel et al., 1976; Nedyalkova and Tillé, 2008). In practice, however, there is not even a consensus about the existence of a generating model, let alone what model to rely on. And even if there is a model, its parameters are unknown. There is evidence, rather empirical, that probability proportional-to-size sampling is not robust towards model misspecifications (e.g. Holmberg and Swensson, 2001). A second result of this paper is to provide some theoretical evidence of this fact.

Many articles discuss robustness in the survey sampling field. Beaumont et al. (2013), for instance, propose a robust estimator that downweights influential observations; Royall and Herson (1973) consider robustness under polynomial models; Bramati (2012) and Zhai and Wiens (2015) propose robust stratification methods. We provide theoretical evidence of lack of robustness of proportional-to-size sampling and propose a method for assisting in the decision about the sampling design.

The contents of the paper are arranged as follows. The optimal strategy under the superpopulation model is defined in section 2. The lack of robustness of this strategy when the model is misspecified is studied in section 3. The method for assisting on the choice of the sampling design is presented in section 4. In section 5, the risk measure introduced in the previous section is extended to be used together with the GREG estimator. Section 6 presents numerical illustrations of the results in the paper. First, we illustrate the lack of robustness of probability proportional-to-size sampling and the flexibility of the GREG estimator with a small simulation study. Second, we illustrate the implementation of the risk measure with real survey data. Finally, section 7 presents some conclusions.

## 2 Optimal strategy under the superpopulation model

Let be a finite population of size with elements labeled . Let

be a known vector of values of

auxiliary variables and the unknown value of a study variable associated to unit . We are interested in the estimation of the total of , .

Let be the power set of . A sample is any subset and a sampling design

is a probability distribution on

, denoted by or simply . Let be the inclusion probability of and the joint inclusion probability of and . A probability sampling design is a sampling design such that for all .

An estimator is a real valued function of the sample, . By strategy we refer to the couple sampling design and estimator, .

We consider only probability sampling designs with fixed sample size. As a convenient stepping stone we begin by considering unbiased linear estimators of the form

 ^ty=(∑Uzk−∑szkπk)+∑sykπk=∑Uzk+∑sekπk (2.1)

with arbitrary known constants and . This estimator is called the difference estimator. The estimator defined in this way is said to be calibrated on as it satisfies . Note that if for all the estimator reduces to , that is, the Horvitz-Thompson estimator (Horvitz and Thompson, 1952). In later sections we focus on the generalized regression estimator (GREG).

The design MSE of the difference estimator is

 MSEp(^ty)=MSEp(∑sekπk)=∑U∑U(πkl−πkπl)ekπkelπl. (2.2)

As mentioned in the introduction, due to the non-existence of an optimal strategy under the design-based approach, often a superpopulation model, , is proposed and we search for an optimal strategy with respect to the anticipated mean-squared error,

 MSEξ0p(^ty)=Eξ0MSE% p(^ty)=Eξ0Ep((^ty−ty)2) (2.3)

We may assume that the -values are realizations of the following model, denoted ,

 Yk=f(xk|δ1)+ϵkwithEξ0(ϵk)=0,Vξ0(ϵk)=σ20g(xk|δ2)2% andEξ0(ϵkϵl)=0∀k≠l (2.4)

where is a vector of parameters, and . Following Rosén (2000a), the terms and will be called trend and spread, respectively. The term trend should not in general be understood in a temporal sense, rather it refers to the development of -values with .

Note that under , in the difference estimator (2.1

) is a random variable that represents the distance between the value of the study variable and

, i.e. . Therefore and . With some algebra, it can be seen from (2.2) and (2.3) that the anticipated MSE of the difference estimator becomes

 MSEξ0p(^ty)=MSEp(∑sf(xk|δ1)−zkπk)+σ20(∑U(1πk−1)g(xk|δ2)2) (2.5)

Nedyalkova and Tillé (2008) derive the anticipated MSE in a more general case.

Tillé and Wilhelm (2017) give the anticipated MSE of the Horvitz-Thompson estimator. The second term in (2.5) is the Godambe-Joshi lower bound (e.g. Särndal et al. 1992, p. 453).

The anticipated MSE in (2.5) is the sum of two positive terms. It is easy to see that if

1. the estimator is calibrated on

the first term vanishes and the anticipated MSE equals the Godambe-Joshi lower bound

 MSEξ0p(^ty)=σ20(∑U(1πk−1)g(xk|δ2)2) (2.6)

Furthermore, after imposing the fixed sample size restriction , if

1. the design is such that , denoted ps(),

the second term is minimized and we obtain

 MSEoptξ0p(^ty)=σ20⎛⎝1n(∑Ug(xk|δ2))2−∑Ug(xk|δ2)2⎞⎠.

Conditions 1 and 2 suggest the specific roles of the design and the estimator in the sampling strategy. The estimator should “explain” the trend in the calibration sense of condition 1. The design should “explain” the spread. A strategy that satisfies conditions 1 and 2 simultaneously will be called optimal. In the same sense, any estimator and any design satisfying, respectively, condition 1 and 2, will be called optimal.

## 3 Robustness under a misspecified model

If the finite population is a realization of the superpopulation model (2.4), and if , and were known, then an optimal strategy could be defined. In this section we study the robustness of this strategy when the model is misspecified.

We begin by defining how “misspecification”shall be understood in this paper. The working model reflects the beliefs the statistician has about the relation between the auxiliary variables and the study variable at the design stage. We shall assume that a true, unknown model exists. Any deviation of with respect to is a misspecification of the model. In order to keep the analysis tractable, we limit ourselves to the situation when the working model is of the form (2.4) and the true model, , is

 Yk=f(xk|β1)+ϵkwithEξ(ϵk)=0,Vξ(ϵk)=σ2g(xk|β2)2andEξ(ϵkϵl)=0∀k≠l (3.1)

where is a vector of parameters, and as in (2.4) and .

###### Result 1.

If is assumed when is the true superpopulation model, the model expected value of the design MSE in (2.2), under the difference estimator satisfying condition 1 above, becomes

 MSEξp(^ty)=MSEp(∑sf(xk|β1)−f(xk|δ1)πk)+σ2(∑Ug(xk|β2)2πk−∑Ug(xk|β2)2) (3.2)

The result is proven by noting that takes the role of in (2.5) and by taking into account that , therefore and . As the model is misspecified, we have deliberately avoided the use of the adjective “anticipated” in result 1.

Using result 1, it can be seen that for a design that satisfies condition 2 we obtain

 MSEξ,πps(^ty)=(∑Ug(xk|δ2)n)2MSEπps(∑sf(xk|β1)−f(xk|δ1)g(xk|δ2))+σ2(∑Ug(xk|δ2)n∑Ug(xk|β2)2g(xk|δ2)−∑Ug(xk|β2)2). (3.3)

It is now possible to see that, even under a mild misspecification as the one considered here, the so-called optimal strategy is not necessarily optimal anymore, as its MSE (3.3) can be greater than the MSE obtained under other designs (3.2).

## 4 Guiding the choice of sampling design with the help of a risk measure

We have seen in section 3 that even a simple misspecification of the working model might result in the so-called optimal strategy not being optimal. It is therefore risky to accept a given model as correct without any type of assessment. While most of the information needed for an “objective” evaluation of the model is not available at the design stage, it is possible to reach some degree of confidence about the parameters in the working model that allows for comparing a set of designs and make the decision about which one to implement. In this section we propose a method to assist in the choice of the sampling design.

The expected MSE (3.2) in result 1 can be viewed as a function of and , as everything else is available at the design stage. To begin with, let us assume that is also known. Then we can write

 Lp(β)=MSEξp(β|x,δ,σ)=MSEp(∑sf(xk|β1)−f(xk|δ1)πk)+σ2(∑Ug(xk|β2)2πk−∑Ug(xk|β2)2)

For any design, , this function can be evaluated at any and it indicates the loss incurred by assuming that is the right parameter when it is, in fact, . We can assume a prior distribution on , , and calculate the risk under ,

 R(p)=Eh(MSEξp(β|x,δ,σ))=∫Θh(β)⋅MSEξp(β|x,δ,σ)dβ, (4.1)

where is the sample space of . The design that yields the smallest risk shall be chosen.

In practice, is unknown. We propose three ways for dealing with it. The first one is to redefine as and calculate the risk as above. The second one is to provide some “guess” about it. The third one is to take into account that (proof in the appendix)

 σ2≈Sf,f¯¯¯¯¯g2⎛⎝1R2f,y−1⎞⎠ (4.2)

where , , and is the correlation between and . (In example 3 below, we give a more convenient expression in a special case.) Although is unknown, for repeated surveys we do have some previous knowledge about it. In other cases it is often possible to have some reasonable “guess” about it.

It remains to comment on the choice of the prior distribution . The choice of the distribution and its parameters is subjective and defined by the statistician. Nevertheless, it should reflect the available knowledge about the model parameter . In particular, should be centered around

. Its variance should reflect how confident we are about the working model. Note that a full confidence on the working model would be a density with all its mass at

, in which case the risk (4.1) would be minimized by the ps design given by condition 2 in section 2.

It might be argued that by introducing an additional source of subjectivity has been added to the choice of the sampling design. The prior may add a certain Bayesian flavor to the process, but note that is only needed for choosing the design. Hence, the inference is still design-based. Furthermore, relying on an assumed model is also subjective in choice of assumption and it does involve a risk. The risk measure in (4.1) allows for quantification of that risk.

## 5 The risk measure under the Generalized Regression Estimator

The difference estimator (2.1) requires that is fully specified in order to calculate , which is undesirable from a practical standpoint. The generalized regression –GREG– estimator is an alternative that allows for the estimation of all or some of the components of at the cost of introducing a small bias. In this section we adapt the material in sections 2 to 4 to strategies using the GREG estimator.

We define the generalized regression estimator in a more general way than in Särndal et al. (1992) as follows. Let () a weight defined by the statistician and where are fixed and are to be estimated. Let also

 ^δ∗∗1s=argminδ∗∗1∑s(yk−f(xk|δ1))2akπk

and . The GREG estimator is

 ^tgreg=(∑Uf(xk|^δ1s)−∑sf(xk|^δ1s)πk)+∑sykπk (5.1)

An approximation to the design MSE of the GREG estimator is of the form (2.2) with where and

 ^δ∗∗1U=argminδ∗∗1∑U(yk−f(xk|δ1))2ak
###### Example 1.

Let us consider the case where . Let , and . In this case we obtain

 ^δ∗∗1s=(∑sxδ′kxδkakπk)−1∑sxδ′kykakπkand^δ∗∗1U=(∑Uxδ′kxδkak)−1∑Uxδ′kykak.

Letting the exponents , we obtain the classical expression of the GREG estimator found in Särndal et al. (1992).

###### Example 2.

The case with only one auxiliary variable, i.e.  with , and is known as the regression estimator. In this case we obtain the well known result that the design MSE can be approximated by expression (2.2) with where and

 ^δ11=N∑Uxδ12kyk−∑Uxδ12k∑UykN∑Ux2δ12k−(∑Uxδ12k)2and^δ10=1N∑Uyk−^δ111N∑Uxδ12k.

### The misspecified model

Let us consider again the situation where the statistician uses the working model (2.4) but the true model is of the form (3.1) with , where is the counterpart of the fixed component . The following result states a condition under which result 1 is valid for the GREG estimator.

###### Result 2.

If converges in distribution to some then

 MSEξp(^tgreg)→MSEp(∑sf(xk|β1)−f(xk|δ1)πk)+σ2(∑Ug(xk|β2)2πk−∑Ug(xk|β2)2) (5.2)

where .

The result is proven by using the fact that if and then .

###### Example 3 (Continuation of Example 1).

Let the working model be as in example 1 and the true model be . Let also , and . In this case, , where

 A=(∑Uxδ′kxδkak)−1∑Uxδ′kxβkak,

and (5.2) becomes

 MSEξp(^tgreg)→MSEp⎛⎝∑s(xβk−xδkA)β∗∗1πk⎞⎠+σ2(∑Ug(xk|β2)2πk−∑Ug(xk|β2)2). (5.3)
###### Example 4 (Continuation of Example 2).

Let the working model be as in example 2 and the true model be with and . It can be shown that (5.2) becomes

 MSEξp(^tgreg)→β211MSEp(∑svkπk)+σ2(∑Ug(xk|β2)2πk−∑Ug(xk|β2)2) (5.4)

with

 vk=(xβ12k−¯¯¯¯¯¯¯¯¯xβ12)−(xδ12k−¯¯¯¯¯¯¯¯xδ12)Sβ,δSδ,δ, (5.5)

and

 ¯¯¯¯¯¯¯¯¯xβ12 =1N∑Uxβ12k Sβ,δ =1N−1∑U(xβ12k−¯¯¯¯¯¯¯¯¯xβ12)(xδ12k−¯¯¯¯¯¯¯¯xδ12) ¯¯¯¯¯¯¯¯xδ12 =1N∑Uxδ12k Sδ,δ =1N−1∑U(xδ12k−¯¯¯¯¯¯¯¯xδ12)2

Note that (5.4) does not depend on .

If the condition for result 2 holds, i.e. if converges in distribution to some , and this value can be calculated or approximated at the design stage for any , then the risk (4.1) can be computed.

Clearly, for the model that has been used in examples 1 and 3, the risk (4.1) can be obtained by means of expression (5.3). Furthermore, for the particular case developed in examples 2 and 4, where and , an alternative approximation of is (proof in the appendix)

 σ2≈β211F0withF0=1¯¯¯¯¯¯¯¯¯x2β2S21,βS1,1(1R2x,y−1R21,β) (5.6)

where

 ¯¯¯¯¯¯¯¯¯x2β2=1N∑Ux2β2kS1,β=1N∑U(xk−¯x)(xβ12k−¯¯¯¯¯¯¯¯¯xβ12)S1,1=1N∑U(xk−¯x)2

with and and are, respectively, the correlation coefficients between and and between and . The latter is unknown but often some decent guess about it is available.

The approximation of in (5.6) is more convenient than (4.2) as now we have that (5.4) is approximated by

 MSEξp(^tgreg)≈β211[MSEp(∑svkπk)+F0(∑Ug(xk|β)2πk−∑Ug(xk|β)2)]

with given by (5.5). This expression depends neither on the intercept nor the parameter , and the slope becomes a proportionality constant that can be ignored.

## 6 Numerical examples

In sections 2 and 3 we have established that the strategy that couples ps sampling with the difference estimator is optimal under a superpopulation model, but it is not robust to misspecifications of this model. In subsection 6.1 we present a small Monte Carlo simulation study carried out to illustrate these results by comparing the optimal strategy and three alternatives.

In sections 4 and 5 we introduced a measure that allows for quantifying the risk of implementing a sampling design, so allowing to guide the choice of design. In subsection 6.2 we illustrate the use of the risk measure with real survey data.

### 6.1 Simulation study under a misspecified model

We compare the efficiency and robustness of four strategies through a simulation study. The four strategies to be compared are ps together with the difference estimator (which is optimal when the model is correct), ps together with the GREG estimator (optimal design), stratified simple random sampling –STSI– together with the difference estimator (optimal estimator) and STSI together with the GREG estimator.

Our implementation of ps makes use of Pareto ps (Rosén, 1997). There is a host of other schemes for drawing ps samples. Nevertheless, Pareto ps is a convenient method with good properties, see for example Rosén (2000a).

Our implementation of STSI makes use of model-based stratification (Wright, 1983). We consider strata with boundaries defined using the cum-rule on as in Särndal et al. 1992 (p. 463) and the sample is allocated using Neyman allocation, . Using the cum-rule may be suboptimal (see Särndal et al. 1992, p. 464) but the efficiency of stratification by a continuous size variable is fairly insensitive to the exact choice of boundaries.

We consider only misspecification of the spread. The trend term is of the form with , and , and . The true spread is with , and . The working spread is with , and .

We will use the difference estimator (2.1) calibrated on . Regarding the GREG estimator, we will fix , whereas the coefficients and will be estimated.

The simulation is set out as follows. The population size is . The

-values are independent realizations from a gamma distribution with shape

and scale plus one unit, whereas is a realization from a gamma distribution with shape and scale

 αk=(β10+β11xβ12k)2σ20x2β2kandλk=σ20x2β2kβ10+β11xβ12k,

where was set in such a way that the correlation between and is . The design MSE of a sample of size is then computed for each strategy. The process is iterated times.

Table 6.1 shows the results of the simulation study. The first three columns indicate the model parameters. The fourth column shows the (simulated) expected MSE of the strategy ps–dif, whereas the last three columns show the (simulated) efficiency of the strategies ps–GREG, STSI–dif and STSI–GREG compared to ps–dif (as a percentage), with efficiency defined as

 eff=1BB∑r=1eff(r)whereeff(r)=MSE(r)ξ,πps(^ty)%MSE(r)ξ,p(^ty),

in such a way that a value of 100 indicates that the strategy is as efficient as ps–dif and values smaller (larger) than one indicate that the strategy is less (more) efficient than ps–dif.

The upper part of Table 6.1 shows the case when the working model coincides with the true model. As expected, the strategy that couples ps with the difference estimator (ps–dif) was always more efficient than the remaining strategies. Nevertheless, the loss in efficiency due to estimating some parameters through the GREG estimator is negligible. On the other hand, there is a remarkable loss in efficiency due to the use of STSI instead of ps. Finally, it is noted from (2.6) that as the anticipated MSE for all strategies does not depend on the trend but only on the spread , the efficiency remains constant under the same value of , independently of the value of .

The lower part of Table 6.1 shows some comparisons under a misspecified model, in particular, a misspecified spread. It can be noted that even under this mild misspecification of the model, ps–dif is not necessarily the best strategy anymore as the strategies using STSI were more efficient in several cases. However, it is not evident when will STSI be more efficient than ps or vice versa. The risk measure introduced in section 4 can be used to guide the choice between designs. The results shown in this section agree with those shown by for example Holmberg and Swensson (2001).

### 6.2 Using the risk measure for choosing the design in a real survey

In this subsection we illustrate the implementation of the risk measure using data from a real survey. We want to estimate where is the set of residential properties in Bogotá, Colombia (of size ) and is the value of the th property in 2017 in COP. , the built-up area of the th property in square meters, is known for every . The auxiliary variable

has mean 184, standard deviation 110 and skewness 2.57. The desired sample size is

.

We assume that a model of the type with and adequately describes the association between and . We plan to use the GREG estimator for estimating and , i.e. . We will use the risk (4.1) in order to assist the decision between ps or STSI. strata are used and we take

as a bivariate normal distribution with no correlation between

and . We consider two cases with different degrees of confidence regarding the working model.

#### Case 1.

In this case no information about , or is available. Naive values of , and are considered. In order to reflect the uncertainty, should have a large variance, therefore we set

 [β12β2]∼N([1.01.0],[0.32952000.32952]).

Evaluation of (4.1) yields and , suggesting that a stratified design should be used.

The design MSE of both strategies is computed and we get, and . The strategy suggested by (4.1) was indeed the best choice.

#### Case 2.

Using a sample from 2010, prior values of , and are proposed. As the uncertainty here is smaller than that in case 1, we set a smaller variance,

 [β12β2]∼N([1.92.0],[0.24712000.24712])

Evaluation of (4.1) yields and , suggesting that a stratified design should be used.

The design MSE of both strategies is computed and we get and . Note that the use of (4.1) prevented us from using ps, whose MSE is almost one thousand times bigger than the one under stratified sampling!

## 7 Conclusions

The strategy that couples ps with the difference estimator is optimal when the parameters of the superpopulation model are known. Taking into account that these assumptions are seldom satisfied, it was shown in section 3 and illustrated in subsection 6.1 that this optimality breaks down even under small misspecifications of the model.

In section 4 we propose a method for choosing the sampling design, which is extended to its use with the GREG estimator in section 5. The method allows for taking the uncertainty about the model parameters into account by introducing a prior distribution on them. Although it could be argued that a source of subjectivity is added by introducing a prior distribution on the parameters, our view is that it is more subjective to choose the design without any type of assessment of the assumptions. Furthermore, inference is still design-based, as the prior is used only for choosing the design.

The method was illustrated with a real dataset, yielding satisfactory results. It should be noted that although the illustrations used stratified simple random sampling, the method in this article is valid for any sampling design.

## Appendix. Proof of (4.2)

###### Proof.

The following expectations are required in the proof,

 EξYk =Eξ[f(xk|β1)+ϵk]=f(xk|β1) (.1) EξY2k =Eξ[(f(xk|β1)+ϵk)2]=f(xk|β1)2+σ2g(xk|β2)2 (.2)

, and are obtained using (.1) and (.2),

 Eξ¯¯¯¯Y =Eξ[1N∑UYk]=1N∑UEξYk=1N∑Uf(xk|β1)≡¯¯¯f (.3) Eξ¯¯¯¯¯¯¯Y2 =Eξ[1N∑UY2k]=1N∑U(f(xk|β1)2+σ2g(xk|β2)