1 Introduction
Over the last 20 years there has been considerable research about model based inference for finite population quantities when there is a selection bias. Most of this research is summarized in Pfeffermann and Sverchkov (2009). Our work is patterned after that of Chambers, Dorfman and Wang (1998), hereafter CDW, who assumed that the only information about the survey design available to the survey analyst is the set of firstorder inclusion probabilities for the sampled units. CDW noted that “it is almost impossible to proceed without fixing ideas on an example”. The example, which they used, is a generalization of one presented by Krieger and Pfeffermann (1992) and we modify it further to make it more plausible for applications. Note that CDW made inference only for superpopulation parameters rather than finite population quantities as we do.
The purpose of our paper is to demonstrate, by example, the value of using Bayesian methods in complicated sample survey situations such as this one, i.e., where there is a selection bias and limited sample information. While completely general solutions to such problems are not available because of differences in the assumptions, our specification should be close to those seen in many surveys. For example, in establishment surveys the selection probability is often proportional to a measure of size which is linearly related to the variable of interest, Y. Also, the distribution of Y is positively skewed, thereby motivating a logarithmic transformation of Y.
For the theoretical development the sample units are assumed to be chosen by Poisson sampling, as CDW did. For the numerical examples, though, the more conventional systematic pps sampling method is used; see Section 3. Let denote
independent Bernoulli random variables where
if unit is selected in the sample and , otherwise. Specifically, CDW assumed(1) 
where
(2) 
Also given and the ,
(3) 
with, independently, . In (2), , corresponding to the sampled units, are known prior to sampling.
While (3) is a realistic specification for many establishment surveys with a measure of size of unit , in other surveys a heavy tailed distribution for the may be more appropriate. While CDW assumed that the
follow a normal distribution, we take
(4) 
Following CDW we assume that the data available to the analyst are the vector of sampled values,
, the vector of corresponding to the sampled units and . From (2) it is clear that are not identifiable. While many restrictions are possible, taking , say, is convenient and this population sum is commonly available.Our objective is a fully Bayesian analysis yielding exact inferences about any finite population quantity of interest, e.g., finite population quantiles. The analysis here is complicated because, from (2),
(i.e., for the nonsampled ) and we must make inference for the nonsampled subject to the restriction that .For convenience, we let the units indexed by denote the sampled units and the units indexed by the nonsampled units. Thus, we have for and for . The selection probabilities, , and the values, , for the sampled units are assumed known and denoted by and . We use and to denote the selection probabilities and response values for the nonsampled units. The vector is denoted by . Also and denote the parameters in (4), and in (3), respectively.
We transform to so that the ’s are centered at 0,
where .
Since and are known, it is clear from (2) that is known. The size measures for the nonsampled units, , denoted by
, are not known at the estimation stage. From the transformation from
to , are known and denoted by while are unknown and denoted by .Most of the research in this area is frequentist, well summarized in Pfeffermann and Sverchkov (2009), often using approximations, and limited to inferences about finite population means and totals. There are four relevant papers using Bayesian methods. To incorporate selection bias, Malec, Davis and Cao (1999) used a hierarchical Bayes method to estimate a finite population mean when there are binary data. Difficulty in including the selection probabilities directly in the model forces them to make an ad hoc adjustment to the likelihood function and use a Bayes, empirical Bayes (i.e., not a full Bayesian) approach. Nandram and Choi (2010) and Nandram, Bhatta, Bhadra and Shen (2013) extended the Malec, Davis and Cao (1999) model. Pfeffermann et al. (2006) assumed a more complex design than we do, i.e., two level modeling with informative probability sampling at each of the two levels. For the most part they used a conditional likelihood, but presented the methodology (Section 3.2) and an example where they used the full likelihood, but made inference only for the “superpopulation” parameters. Pfeffermann et al. (2006) assumed that the covariates are known for all units in the population, thus greatly simplifying their analysis. The two differences, i.e., assuming limited information and making inferences for the nonsampled covariates, provide a challenging computational problem. However, once solved, as in our paper, the methodology can be applied.
The work in this paper is based on methodology developed in Ma (2010). Recently, Zangeneh and Little (2015) provide a different approach to a related problem of inference for the finite population total of Y when sampling is with probability proportional to size X. They use a Bayesian bootstrap to make inference for the values of X associated with the nonsampled units, taking account of the assumed known population total of X. Given X, they model Y using penalized splines. The use of the bootstrap avoids parametric assumptions, but assumes, perhaps unreasonably, that only the sampled values of X occur in the population. Their inferential approach, reversing the usual factorization of the distributions of and corresponding , leads to a dependence, of unknown importance, between the parameters associated with the selection effect and the sampling distribution of . For a similar method based on poststratification, see Si, Pillai and Gelman (2015).
In Section 2 we describe the Bayesian methodology while in Section 3 we use simulated examples to compare informative sampling with ignorable sampling and with standard design based methodology based on the HorvitzThompson estimator. There is further discussion in Section 4.
2 Methodology for Informative Sampling
We describe the Bayesian model and inference in Section 2.1, and the computational methods in Section 2.2.
2.1 Model for Informative Sampling and Inference
We have observed and , the vector of sampled . In addition, is known. The posterior distribution for and can be written as
where , and and are the vectors of , and for the N units in the finite population, and ,
From (4), we have
In Appendix A we have shown that the density of conditional on and is
where and is the normalization constant. The density of conditional on and is straightforward,
and is the prior distribution. For we use the noninformative prior distribution, i.e., .
2.2 Computational Methods
For convenience, the posterior distribution can be further rewritten as
(5) 
with and
where
We use the sampling importance resampling (SIR) algorithm (Smith and Gelfand 1992) and the Gibbs sampler (Gelfand and Smith 1990) to perform the computation. The Gibbs sampler is used to draw samples from and the SIR algorithm is used to subsample the Gibbs sample to get a sample from the posterior distribution, . The Gibbs sampler is described in Appendix B.
Letting where M is the number of iterates obtained from the Gibbs sampler for draws made from , the weights in the SIR algorithm are , where
By (5), Thus, the SIR algorithm is performed by drawing iterates from without replacement.
When the SIR algorithm is executed, the normalization constant needs to be evaluated. One can compute by drawing samples from the multivariate normal density, and counting how many samples fall in a region , defined in Appendix A. This procedure performs poorly because when a single value of is drawn from the multivariate normal density, at least one restriction for is not satisfied. Hence, the estimate of can be zero which is not desirable. We get around this difficulty by splitting into two regions and converting this problem into a classical highdimensional integration problem and a multivariate normal probability problem. The details of evaluating are given in Appendix C.
3 Simulation Study
In this section we compare the methodology presented in Section 2, NIG (nonignorable), with inferences assuming an ignorable model (IG) and standard design based methodology using the HorvitzThompson (HT) estimator. For IG the model is given by (2) and (3), i.e., without any consideration of the selection probabilities. For the standard design based inference we use the HorvitzThompson point estimator, . (Recall that is known.) The standard percent interval is where .
Here we emphasize inference for the finite population total, i.e., point estimation and nominal 95 percent intervals. Inference for finite population quantiles and other quantities can be made in a straightforward manner: use (5) and the methodology in the Appendices to make inference for , then for .
We start by choosing values for the superpopulation parameters, i.e., and . Then we draw a finite population of size
from the joint distribution of
and . From this finite population we draw a sample of size using systematic pps sampling. We repeat these steps times. The examples presented in this section use and . We have also used and larger values of but have seen no qualitative differences in our results.For NIG the methodology is described in Section 2.2. In this study both HPD and equal tailed intervals were used. One thousand Gibbs samples were selected from the approximate posterior distribution (after a burnin of 5000 runs), and 200 without replacement samples were chosen to implement the SIR algorithm. We have also used larger numbers of Gibbs samples and SIR subsamples, but have seen no qualitative differences in our results.
Our first comparisons use the relative bias, interval coverage and width, averaged over the finite populations, i.e., these are unconditional evaluations. First we plot in Figure 1 the values of the sample mean ( axis) vs. the posterior mean for the nonsampled units ( axis) for NIG for 200 finite populations selected from the superpopulation with Clearly, as expected, the sample means tend to be large, the posterior means (for the nonsampled units) small, as they should be.
Tables 1 and 2 compare NIG with IG for several choices of the superpopulation parameters, chosen to yield a range of correlations between and but restricted to cases where is proportional to .
Relative Bias  

IG  NIG  
0.5  0.16  0.25  0.0355  0.0283  0.0106  0.0034 
0.5  0.38  0.57  0.2177  0.1893  0.0159  0.0058 
0.5  0.70  0.86  0.9884  0.8246  0.0049  0.0694 
1.0  0.10  0.26  0.0108  0.0108  0.0008  0.0010 
1.0  0.25  0.58  0.0754  0.0746  0.0050  0.0045 
1.0  0.50  0.85  0.3798  0.3544  0.0123  0.0257 
1.5  0.06  0.26  0.0069  0.0067  0.0027  0.0026 
1.5  0.15  0.56  0.0320  0.0305  0.0060  0.0046 
1.5  0.35  0.86  0.1812  0.1765  0.0034  0.0003 
2.0  0.04  0.28  0.0011  0.0015  0.0004  0.0000 
2.0  0.10  0.59  0.0158  0.0151  0.0058  0.0051 
2.0  0.20  0.83  0.0529  0.0524  0.0040  0.0036 
Table 1 presents the relative bias associated with and the finite population mean, . As expected, the relative bias for NIG is very small while that for IG is large for moderate to large correlations. Table 2 compares the average widths and coverages for NIG and IG. Clearly for moderate to large correlations the average widths of the intervals for IG are much too large, e.g., 4.0554 for IG vs. 0.9287 for NIG (see ; inference for ). To make the two methods comparable we adjusted the width of each IG interval to make it the same as the corresponding NIG interval and used these intervals to evaluate the adjusted coverage of IG in the last column. For example, for and , the widths are quite different for IG(2.3391) and NIG(1.2768). Making the width for IG equal to that for NIG, 1.2768, the coverage for IG is 0.7300 (column 6) which should be compared with 0.9600 (column 11), the coverage for NIG.
95% CI  

IG  NIG  
E(Y)  E(Y)  
width  CP  adjusted CP  width  CP  adjusted CP  width  CP  width  CP  
0.5  0.16  0.25  0.3961  0.9400  0.9200  0.3753  0.9300  0.9100  0.3488  0.9050  0.3301  0.9400 
0.5  0.38  0.57  1.5487  0.8850  0.5400  1.3378  0.9000  0.5400  0.7681  0.9150  0.7076  0.9100 
0.5  0.70  0.86  10.6780  0.9650  0.0150  8.7670  0.9700  0.0050  1.0286  0.9500  0.7913  0.8100 
1.0  0.10  0.26  0.3679  0.9150  0.8600  0.3489  0.9300  0.8700  0.3295  0.8850  0.3092  0.8850 
1.0  0.25  0.58  1.0931  0.9450  0.7750  1.0354  0.9400  0.7900  0.7685  0.9350  0.7128  0.9200 
1.0  0.50  0.85  4.3844  0.9700  0.1250  4.0554  0.9600  0.0750  1.1294  0.9400  0.9287  0.9100 
1.5  0.06  0.26  0.3771  0.9350  0.9150  0.3577  0.9450  0.9100  0.3351  0.9150  0.3142  0.9150 
1.5  0.15  0.56  0.9624  0.9500  0.8350  0.9121  0.9550  0.8200  0.7122  0.9050  0.6501  0.9200 
1.5  0.35  0.86  3.4085  0.9600  0.3450  3.2234  0.9700  0.2800  1.2770  0.9250  1.0563  0.9450 
2.0  0.04  0.28  0.4136  0.9500  0.8750  0.3923  0.9350  0.8850  0.3524  0.8800  0.3286  0.8750 
2.0  0.10  0.59  1.0556  0.9750  0.9150  1.0015  0.9700  0.8800  0.7795  0.9350  0.7082  0.9100 
2.0  0.20  0.83  2.3391  1.0000  0.7300  2.2170  0.9800  0.6450  1.2768  0.9600  1.0809  0.8950 
That is, for either or compare the values of “adjusted CP” for IG with “CP” for NIG. For small corr the two coverages are similar. However, as expected, for moderate to large values of corr the coverage for NIG is generally close to the nominal 0.95 while that for IG is smaller, markedly so for the large correlations.
In Table 3 we compare NIG with the standard design based method using the same set of superpopulation parameters as in Tables 1 and 2. The relative bias of each method is small, as expected.
Relative Bias  95% CI  

HT  NIG  HT  NIG  
width  CP  adjusted CP  width  CP  
0.50  0.16  0.25  0.0312  0.0037  1.0936  0.9700  0.5650  0.3295  0.9400 
0.50  0.38  0.57  0.0335  0.0055  1.0027  0.9150  0.8250  0.7082  0.9100 
0.50  0.70  0.86  0.0084  0.0694  1.1849  0.8950  0.8150  0.7915  0.7950 
1.00  0.10  0.26  0.0210  0.0010  1.2400  0.9900  0.5700  0.3087  0.8800 
1.00  0.25  0.58  0.0166  0.0045  1.2630  0.9400  0.8050  0.7099  0.9250 
1.00  0.50  0.85  0.0181  0.0261  1.1703  0.8850  0.8850  0.9275  0.8950 
1.50  0.06  0.26  0.0040  0.0026  1.2794  1.0000  0.6050  0.3154  0.9250 
1.50  0.15  0.56  0.0035  0.0047  1.2439  0.9950  0.8400  0.6574  0.9150 
1.50  0.35  0.86  0.0045  0.0002  1.2843  0.9550  0.9150  1.0626  0.9400 
2.00  0.04  0.28  0.0012  0.0000  1.2845  1.0000  0.6100  0.3291  0.8800 
2.00  0.10  0.59  0.0038  0.0051  1.2869  1.0000  0.8500  0.7078  0.9150 
2.00  0.20  0.83  0.0004  0.0037  1.2739  0.9500  0.9250  1.0825  0.9050 
However, for each set of parameter values, the average width of the HT interval is much wider than its NIG counterpart, leading to somewhat better coverage. To make the two methods comparable we adjusted the width of each HT interval to make it the same as the corresponding NIG interval and used these intervals to evaluate the adjusted coverage of HT in the column. Thus, the coverage for NIG is much better when there is small to moderate correlation while there is little difference when there is a large correlation. This, too, is not surprising since the HT based interval should perform very well when is proportional to .
We next compare NIG with HT assuming that the intercept, , is larger than zero. Table 4 gives the relative biases and interval widths and coverages. It is clear that the biases are small, as expected. When the correlation is moderate to large, the widths of the nominal 95 percent intervals for HT are much wider than those for NIG. Making the adjustment described above so that the widths of the HT and NIG intervals are the same, the coverage for NIG is better than HT, markedly so when the correlation is large, e.g., 0.94 vs. 0.21 when the correlation is 0.99 (). A referee noted that a comparison with the GREG estimator may have been more appropriate when, as here, .
Relative Bias  95% CI  
HT  NIG  HT  NIG  
width  CP  adjusted CP  width  CP  
10  5.5  0.27  0.0130  0.0279  1.9767  0.8850  0.8900  2.0714  0.9100 
10  2.5  0.57  0.0056  0.0439  1.5703  0.9350  0.9400  1.7180  0.9300 
10  1  0.86  0.0062  0.0020  1.5234  0.9850  0.9350  1.0249  0.9500 
10  0.1  0.99  0.0026  0.0016  1.5505  1.0000  0.2400  0.1336  0.9300 
50  5.5  0.27  0.0249  0.0310  1.7155  0.8800  0.8950  1.9732  0.9100 
50  2.5  0.57  0.0111  0.0370  1.8071  0.9100  0.8750  1.6656  0.9150 
50  1  0.86  0.0125  0.0107  1.7643  0.9700  0.8150  1.0444  0.9300 
50  0.1  0.99  0.0097  0.0019  1.8269  1.0000  0.2100  0.1414  0.9400 
400  5.5  0.27  0.0072  0.0556  1.9680  0.8950  0.8900  2.0915  0.9000 
400  2.5  0.57  0.0051  0.0198  1.8365  0.9300  0.9250  1.7082  0.9350 
400  1  0.86  0.0048  0.0061  1.8423  0.9700  0.8250  1.0287  0.9350 
400  0.1  0.99  0.0058  0.0008  1.9884  1.0000  0.2250  0.1402  0.9550 
The results presented in Tables 14 are averages over a large set of finite populations and do not display variation over the populations. Figure 2 shows the relative bias ( axis) plotted against the 200 finite populations for NIG (filled circle) and HT (empty circle) with parameter setting: . This is a case with , supposedly favorable to the HT estimator, but one can see that the relative bias of HT varies more widely than the relative bias of NIG. Thus, many individual HT samples will have a large relative bias. In Figure 3, we assume an intercept greater than zero with parameter setting: . Here, the relative bias of NIG is essentially constant over the populations while that for HT varies widely. Clearly, these figures show the advantage of using NIG, i.e., more consistent conditional performance.
Finally, note that we have considered a large number of combinations of the superpopulation parameters, i.e., . The results shown are typical of this large ensemble of evaluations.
4 Summary
We have shown how to carry out a complete analysis of a complicated problem using survey data; i.e., where the analyst has only limited information about the survey design and there is a selection bias. Our model in (2) and (3) is appropriate for many establishment surveys while our specification of the model for sample selection in (4) should provide a good approximation for many survey designs.
Our examples show that relating the selection probabilities to the responses will provide more appropriate intervals vis a vis a model that does not account for selection bias. This is especially true when the correlation between the variable of interest, Y, and the covariate, , is high, a situation common in establishment surveys.
There are also (overall) gains for our method when compared with a standard approach based on the HorvitzThompson (HT) estimator. Of special note is the improved conditional performance. While the unconditional bias for the HT method may be small, it is common to have substantial variation over the samples. Conversely, the conditional bias associated with our method has significantly less variation (Figures 23).
Appendix A Distribution of Under Restrictions
Based on the transformation from to , the reverse transformation is
The Jacobian of this transformation can be computed as
and . For the distribution of given and is
i.e.,
Due to the fact that ’s are independent, the joint distribution of given and is
Then the distribution of given and can be written as
where for .
There are also some restrictions for , which are related to the restrictions for . The restrictions for can be summarized as

for ;

Based on the transformation from to , the restrictions for are

;

for

Given , the conditional distribution of is
Then under the second and third restrictions for , the conditional distribution of is
where
Finally, the distribution of under the restrictions for is
where and
Appendix B Gibbs Sampler
We use the Gibbs sampler to draw samples from . In order to perform the Gibbs sampler, we need to find the conditional distributions of , and respectively given everything else.
The conditional distribution of , given is
(6) 
where
From (6), we see that given , has a normal distribution with mean
and variance
, i.e.,Also, given , has a normal distribution with mean and variance , i.e.,
Comments
There are no comments yet.