# Bayesian Predictive Inference For Finite Population Quantities Under Informative Sampling

We investigate Bayesian predictive inference for finite population quantities when there are unequal probabilities of selection. Only limited information about the sample design is available; i.e., only the first-order selection probabilities corresponding to the sample units are known. Our methodology, unlike that of Chambers, Dorfman and Wang (1998), can be used to make inference for finite population quantities and provides measures of precision and intervals. Moreover, our methodology, using Markov chain Monte Carlo methods, avoids the necessity of using asymptotic closed form approximations, necessary for the other approaches that have been proposed. A set of simulated examples shows that the informative model provides improved precision over a standard ignorable model, and corrects for the selection bias.

## Authors

• 1 publication
• 1 publication
• 3 publications
• 31 publications
• ### Fast spatial inference in the homogeneous Ising model

The Ising model is important in statistical modeling and inference in ma...
12/06/2017 ∙ by Alejandro Murua, et al. ∙ 0

• ### Scalable computation of predictive probabilities in probit models with Gaussian process priors

Predictive models for binary data are fundamental in various fields, and...
09/03/2020 ∙ by Jian Cao, et al. ∙ 0

• ### Bayesian nonparametric analysis of Kingman's coalescent

Kingman's coalescent is one of the most popular models in population gen...
04/19/2018 ∙ by Stefano Favaro, et al. ∙ 0

• ### General Finite Sample Inference for Experiments with Examples from Health Care

I exploit knowledge of the randomization process within an experiment to...
12/13/2019 ∙ by Amanda Kowalski, et al. ∙ 0

• ### A Bayesian model of acquisition and clearance of bacterial colonization

Bacterial populations that colonize a host play important roles in host ...
11/27/2018 ∙ by Marko Järvenpää, et al. ∙ 0

• ### A Bayesian Semiparametric Jolly-Seber Model with Individual Heterogeneity: An Application to Migratory Mallards at Stopover

We propose a Bayesian hierarchical Jolly-Seber model that can account fo...
11/05/2018 ∙ by Guohui Wu, et al. ∙ 0

• ### A Bayesian machine scientist to aid in the solution of challenging scientific problems

Closed-form, interpretable mathematical models have been instrumental fo...
04/25/2020 ∙ by Roger Guimerà, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Over the last 20 years there has been considerable research about model based inference for finite population quantities when there is a selection bias. Most of this research is summarized in Pfeffermann and Sverchkov (2009). Our work is patterned after that of Chambers, Dorfman and Wang (1998), hereafter CDW, who assumed that the only information about the survey design available to the survey analyst is the set of first-order inclusion probabilities for the sampled units. CDW noted that “it is almost impossible to proceed without fixing ideas on an example”. The example, which they used, is a generalization of one presented by Krieger and Pfeffermann (1992) and we modify it further to make it more plausible for applications. Note that CDW made inference only for superpopulation parameters rather than finite population quantities as we do.

The purpose of our paper is to demonstrate, by example, the value of using Bayesian methods in complicated sample survey situations such as this one, i.e., where there is a selection bias and limited sample information. While completely general solutions to such problems are not available because of differences in the assumptions, our specification should be close to those seen in many surveys. For example, in establishment surveys the selection probability is often proportional to a measure of size which is linearly related to the variable of interest, Y. Also, the distribution of Y is positively skewed, thereby motivating a logarithmic transformation of Y.

For the theoretical development the sample units are assumed to be chosen by Poisson sampling, as CDW did. For the numerical examples, though, the more conventional systematic pps sampling method is used; see Section 3. Let denote

independent Bernoulli random variables where

if unit is selected in the sample and , otherwise. Specifically, CDW assumed

 Pr(~I|π1,...,πN)=(∏i∈sπi)⎛⎝∏i∉s(1−πi)⎞⎠, (1)

where

 πi=Pr(Ii=1)=nνi∑Nj=1νj, (2)

Also given and the ,

 νi=β0+β1Yi+ϵi,i=1,...,N (3)

with, independently, . In (2), , corresponding to the sampled units, are known prior to sampling.

While (3) is a realistic specification for many establishment surveys with a measure of size of unit , in other surveys a heavy tailed distribution for the may be more appropriate. While CDW assumed that the

follow a normal distribution, we take

 log(Yi)∼N(μ,σ2). (4)

Following CDW we assume that the data available to the analyst are the vector of sampled values,

, the vector of corresponding to the sampled units and . From (2) it is clear that are not identifiable. While many restrictions are possible, taking , say, is convenient and this population sum is commonly available.

Our objective is a fully Bayesian analysis yielding exact inferences about any finite population quantity of interest, e.g., finite population quantiles. The analysis here is complicated because, from (2),

(i.e., for the non-sampled ) and we must make inference for the non-sampled subject to the restriction that .

For convenience, we let the units indexed by denote the sampled units and the units indexed by the non-sampled units. Thus, we have for and for . The selection probabilities, , and the values, , for the sampled units are assumed known and denoted by and . We use and to denote the selection probabilities and response values for the non-sampled units. The vector is denoted by . Also and denote the parameters in (4), and in (3), respectively.

We transform to so that the ’s are centered at 0,

 Z1=ν1−¯ν,Z2=ν2−¯ν,...,ZN−1=νN−1−¯ν,ZN=¯ν

where .

Since and are known, it is clear from (2) that is known. The size measures for the non-sampled units, , denoted by

, are not known at the estimation stage. From the transformation from

to , are known and denoted by while are unknown and denoted by .

Most of the research in this area is frequentist, well summarized in Pfeffermann and Sverchkov (2009), often using approximations, and limited to inferences about finite population means and totals. There are four relevant papers using Bayesian methods. To incorporate selection bias, Malec, Davis and Cao (1999) used a hierarchical Bayes method to estimate a finite population mean when there are binary data. Difficulty in including the selection probabilities directly in the model forces them to make an ad hoc adjustment to the likelihood function and use a Bayes, empirical Bayes (i.e., not a full Bayesian) approach. Nandram and Choi (2010) and Nandram, Bhatta, Bhadra and Shen (2013) extended the Malec, Davis and Cao (1999) model. Pfeffermann et al. (2006) assumed a more complex design than we do, i.e., two level modeling with informative probability sampling at each of the two levels. For the most part they used a conditional likelihood, but presented the methodology (Section 3.2) and an example where they used the full likelihood, but made inference only for the “super-population” parameters. Pfeffermann et al. (2006) assumed that the covariates are known for all units in the population, thus greatly simplifying their analysis. The two differences, i.e., assuming limited information and making inferences for the non-sampled covariates, provide a challenging computational problem. However, once solved, as in our paper, the methodology can be applied.

The work in this paper is based on methodology developed in Ma (2010). Recently, Zangeneh and Little (2015) provide a different approach to a related problem of inference for the finite population total of Y when sampling is with probability proportional to size X. They use a Bayesian bootstrap to make inference for the values of X associated with the nonsampled units, taking account of the assumed known population total of X. Given X, they model Y using penalized splines. The use of the bootstrap avoids parametric assumptions, but assumes, perhaps unreasonably, that only the sampled values of X occur in the population. Their inferential approach, reversing the usual factorization of the distributions of and corresponding , leads to a dependence, of unknown importance, between the parameters associated with the selection effect and the sampling distribution of . For a similar method based on poststratification, see Si, Pillai and Gelman (2015).

In Section 2 we describe the Bayesian methodology while in Section 3 we use simulated examples to compare informative sampling with ignorable sampling and with standard design based methodology based on the Horvitz-Thompson estimator. There is further discussion in Section 4.

## 2 Methodology for Informative Sampling

We describe the Bayesian model and inference in Section 2.1, and the computational methods in Section 2.2.

### 2.1 Model for Informative Sampling and Inference

We have observed and , the vector of sampled . In addition, is known. The posterior distribution for and can be written as

 π(~yns,~zns,~ψ,~η|~ys,~zs,~I) ∝ π(~y,~z,~ψ,~η,~I),

where , and and are the vectors of , and for the N units in the finite population, and ,

 R= {(zn+1,zn+2,...,zN−1)|−tN≤zi≤tn−tN for i=n+1,n+2,...,N−1; = tN−tn−n∑j=1zj≤N−1∑j=n+1zj≤tN−n∑j=1zj}

From (4), we have

In Appendix A we have shown that the density of conditional on and is

 P(~z|~y,~ψ,~η)= 1C(~yN,~ψ,~η)N(√2πσe)−Nexp{−12σ2e[t2N−2tNN∑i=1θi = +(N−1∑i=1zi+θN)2+N−1∑i=1(zi−θi)2]},

where and is the normalization constant. The density of conditional on and is straightforward,

 P(~y|~ψ,~η)=(√2πσ)−Nexp{−12σ2N∑i=1(log(yi)−μ)2}

and is the prior distribution. For we use the non-informative prior distribution, i.e., .

### 2.2 Computational Methods

For convenience, the posterior distribution can be further rewritten as

 π(~yns,~zns,~ψ,~η|~ys,~zs,~I) ∝1C(~yN,~ψ,~η)πa(~yns,~zns,~ψ,~η|~ys,~zs,~I) (5)

with and

 πa (~yns,~zns,~ψ,~η|~ys,~zs,~I)=L(σeσ)−N−2exp{−12σ2N∑i=1(log(yi)−μ)2} exp{−12σ2e[t2N−2tNN∑i=1θi+(N−1∑i=1zi+θN)2+N−1∑i=1(zi−θi)2]},

where

 L=N−1∏i=n+1[1−nNzN(zi+zN)][1−(nNzN)(zN−N−1∑i=1zi)],

We use the sampling importance resampling (SIR) algorithm (Smith and Gelfand 1992) and the Gibbs sampler (Gelfand and Smith 1990) to perform the computation. The Gibbs sampler is used to draw samples from and the SIR algorithm is used to subsample the Gibbs sample to get a sample from the posterior distribution, . The Gibbs sampler is described in Appendix B.

Letting where M is the number of iterates obtained from the Gibbs sampler for draws made from , the weights in the SIR algorithm are , where

 ~wk=π(~y(k)ns,~z(k)ns,~ψ(k),~η(k)|~ys,~zs,~I)πa(~y(k)ns,~z(k)ns,~ψ(k),~η(k)|~ys,~zs,~I).

By (5), Thus, the SIR algorithm is performed by drawing iterates from without replacement.

When the SIR algorithm is executed, the normalization constant needs to be evaluated. One can compute by drawing samples from the multivariate normal density, and counting how many samples fall in a region , defined in Appendix A. This procedure performs poorly because when a single value of is drawn from the multivariate normal density, at least one restriction for is not satisfied. Hence, the estimate of can be zero which is not desirable. We get around this difficulty by splitting into two regions and converting this problem into a classical high-dimensional integration problem and a multivariate normal probability problem. The details of evaluating are given in Appendix C.

## 3 Simulation Study

In this section we compare the methodology presented in Section 2, NIG (non-ignorable), with inferences assuming an ignorable model (IG) and standard design based methodology using the Horvitz-Thompson (HT) estimator. For IG the model is given by (2) and (3), i.e., without any consideration of the selection probabilities. For the standard design based inference we use the Horvitz-Thompson point estimator, . (Recall that is known.) The standard percent interval is where .

Here we emphasize inference for the finite population total, i.e., point estimation and nominal 95 percent intervals. Inference for finite population quantiles and other quantities can be made in a straightforward manner: use (5) and the methodology in the Appendices to make inference for , then for .

We start by choosing values for the super-population parameters, i.e., and . Then we draw a finite population of size

from the joint distribution of

and . From this finite population we draw a sample of size using systematic pps sampling. We repeat these steps times. The examples presented in this section use and . We have also used and larger values of but have seen no qualitative differences in our results.

For NIG the methodology is described in Section 2.2. In this study both HPD and equal tailed intervals were used. One thousand Gibbs samples were selected from the approximate posterior distribution (after a burn-in of 5000 runs), and 200 without replacement samples were chosen to implement the SIR algorithm. We have also used larger numbers of Gibbs samples and SIR subsamples, but have seen no qualitative differences in our results.

Our first comparisons use the relative bias, interval coverage and width, averaged over the finite populations, i.e., these are unconditional evaluations. First we plot in Figure 1 the values of the sample mean ( axis) vs. the posterior mean for the non-sampled units ( axis) for NIG for 200 finite populations selected from the super-population with Clearly, as expected, the sample means tend to be large, the posterior means (for the non-sampled units) small, as they should be.

Tables 1 and 2 compare NIG with IG for several choices of the super-population parameters, chosen to yield a range of correlations between and but restricted to cases where is proportional to .

Table 1 presents the relative bias associated with and the finite population mean, . As expected, the relative bias for NIG is very small while that for IG is large for moderate to large correlations. Table 2 compares the average widths and coverages for NIG and IG. Clearly for moderate to large correlations the average widths of the intervals for IG are much too large, e.g., 4.0554 for IG vs. 0.9287 for NIG (see ; inference for ). To make the two methods comparable we adjusted the width of each IG interval to make it the same as the corresponding NIG interval and used these intervals to evaluate the adjusted coverage of IG in the last column. For example, for and , the widths are quite different for IG(2.3391) and NIG(1.2768). Making the width for IG equal to that for NIG, 1.2768, the coverage for IG is 0.7300 (column 6) which should be compared with 0.9600 (column 11), the coverage for NIG.

That is, for either or compare the values of “adjusted CP” for IG with “CP” for NIG. For small corr the two coverages are similar. However, as expected, for moderate to large values of corr the coverage for NIG is generally close to the nominal 0.95 while that for IG is smaller, markedly so for the large correlations.

In Table 3 we compare NIG with the standard design based method using the same set of superpopulation parameters as in Tables 1 and 2. The relative bias of each method is small, as expected.

However, for each set of parameter values, the average width of the HT interval is much wider than its NIG counterpart, leading to somewhat better coverage. To make the two methods comparable we adjusted the width of each HT interval to make it the same as the corresponding NIG interval and used these intervals to evaluate the adjusted coverage of HT in the column. Thus, the coverage for NIG is much better when there is small to moderate correlation while there is little difference when there is a large correlation. This, too, is not surprising since the HT based interval should perform very well when is proportional to .

We next compare NIG with HT assuming that the intercept, , is larger than zero. Table 4 gives the relative biases and interval widths and coverages. It is clear that the biases are small, as expected. When the correlation is moderate to large, the widths of the nominal 95 percent intervals for HT are much wider than those for NIG. Making the adjustment described above so that the widths of the HT and NIG intervals are the same, the coverage for NIG is better than HT, markedly so when the correlation is large, e.g., 0.94 vs. 0.21 when the correlation is 0.99 (). A referee noted that a comparison with the GREG estimator may have been more appropriate when, as here, .

The results presented in Tables 1-4 are averages over a large set of finite populations and do not display variation over the populations. Figure 2 shows the relative bias ( axis) plotted against the 200 finite populations for NIG (filled circle) and HT (empty circle) with parameter setting: . This is a case with , supposedly favorable to the HT estimator, but one can see that the relative bias of HT varies more widely than the relative bias of NIG. Thus, many individual HT samples will have a large relative bias. In Figure 3, we assume an intercept greater than zero with parameter setting: . Here, the relative bias of NIG is essentially constant over the populations while that for HT varies widely. Clearly, these figures show the advantage of using NIG, i.e., more consistent conditional performance.

Finally, note that we have considered a large number of combinations of the super-population parameters, i.e., . The results shown are typical of this large ensemble of evaluations.

## 4 Summary

We have shown how to carry out a complete analysis of a complicated problem using survey data; i.e., where the analyst has only limited information about the survey design and there is a selection bias. Our model in (2) and (3) is appropriate for many establishment surveys while our specification of the model for sample selection in (4) should provide a good approximation for many survey designs.

Our examples show that relating the selection probabilities to the responses will provide more appropriate intervals vis a vis a model that does not account for selection bias. This is especially true when the correlation between the variable of interest, Y, and the covariate, , is high, a situation common in establishment surveys.

There are also (overall) gains for our method when compared with a standard approach based on the Horvitz-Thompson (HT) estimator. Of special note is the improved conditional performance. While the unconditional bias for the HT method may be small, it is common to have substantial variation over the samples. Conversely, the conditional bias associated with our method has significantly less variation (Figures 2-3).

## Appendix A Distribution of ~Z Under Restrictions

Based on the transformation from to , the reverse transformation is

 ν1=Z1+ZN,ν2=Z2+ZN,,...,νN−1=ZN−1+ZN,νN=ZN−N−1∑i=1Zi.

The Jacobian of this transformation can be computed as

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝100...01010...01........000...11−1−1−1...−1−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠N×N

and . For the distribution of given and is

 νi|β0,β1,σ2e,Yi∼N(β0+β1Yi, σ2e)

i.e.,

 fνi(νi)=1√2πσeexp{−(νi−β0−β1yi)22σ2e}.

Due to the fact that ’s are independent, the joint distribution of given and is

 f~νi(ν1,ν2,...,νN)=N∏i=1[1√2πσeexp{−(νi−β0−β1yi)22σ2e}].

Then the distribution of given and can be written as

 f~Z(z1,z2,...,zN)= N(√2πσe)−Nexp{−12σ2e[Nz2N−2zNN∑i=1θi = +(N−1∑i=1zi+θN)2+N−1∑i=1(zi−θi)2]},

where for .

There are also some restrictions for , which are related to the restrictions for . The restrictions for can be summarized as

1. for ;

Based on the transformation from to , the restrictions for are

1. ;

2. for

Given , the conditional distribution of is

 f(z1,z2,...,zN−1|zN=tN)=1f(zN=tN)f(z1,z2,...,zN−1,zN=tN).

Then under the second and third restrictions for , the conditional distribution of is

 fR0(z1,z2,...,zN−1|zN=tN) =f(z1,z2,...,zN−1|zN=tN)∫R0f(z1,z2,...,zN−1|zN=tN)dz1dz2...dzN−1 =f(z1,z2,...,zN−1,zN=tN)∫R0f(z1,z2,...,zN−1,zN=tN)dz1dz2...dzN−1,

where

 R0={(z1,z2,...,zN−1)|−tN≤zi≤tn−tN for i=1,2,...,N−1;tN−tn≤N−1∑j=1zj≤tn}.

Finally, the distribution of under the restrictions for is

 fR0(z1,z2,...,zN−1,zN) = 1C(~yN,~ψ,~η)N(√2πσe)−Nexp{−12σ2e[t2N−2tNN∑i=1θi = +(N−1∑i=1zi+θN)2+N−1∑i=1(zi−θi)2]},

where and

 C(~yN,~ψ,~η) = ∫R0N(√2πσe)−Nexp{−12σ2e[t2N−2tNN∑i=1θi+(N−1∑i=1zi+θN)2 = +N−1∑i=1(zi−θi)2]}dz1dz2...dzN−1.

## Appendix B Gibbs Sampler

We use the Gibbs sampler to draw samples from . In order to perform the Gibbs sampler, we need to find the conditional distributions of , and respectively given everything else.

The conditional distribution of , given is

 P(~yns|~ys,~zN,~ψ,~η,~I)∝exp{T1N−1∑i=n+1(yi+Ti2T1)2+T1(yN+T22T1)2}, (6)

where

 T1=−12σ2−β212σ2e,T2=μσ2−β1(β0+∑N−1i=1zi−zN)σ2e,
 Ti=μσ2−β1(β0−zi−zN)σ2e, i=n+1,n+2,...,N−1.

From (6), we see that given , has a normal distribution with mean

and variance

, i.e.,

 yN|yn+1,...,yN−1,~ys,~zN,~ψ,~η,~I∼N(−T22T1, −12T1).

Also, given , has a normal distribution with mean and variance , i.e.,

 yi|yn+1,...,yi−1,yi+1,...,yN−1,yN,