# Unit Level Modeling of Survey Data for Small Area Estimation Under Informative Sampling: A Comprehensive Overview with Extensions

Model-based small area estimation is frequently used in conjunction with survey data in order to establish estimates for under-sampled or unsampled geographies. These models can be specified at either the area-level, or the unit-level, but unit-level models often offer potential advantages such as more precise estimates and easy spatial aggregation. Nevertheless, relative to area-level models literature on unit-level models is less prevalent. In modeling small areas at the unit level, challenges often arise as a consequence of the informative sampling mechanism used to collect the survey data. This paper provides a comprehensive methodological review for unit-level models under informative sampling, with an emphasis on Bayesian approaches. To provide insight into the differences between methods, we conduct a simulation study that compares several of the described approaches. In addition, the methods used for simulation are further illustrated through an application to the American Community Survey. Finally, we present several extensions and areas for future research.

## Authors

• 6 publications
• 4 publications
• 11 publications
• ### Conjugate Bayesian Unit-level Modeling of Count Data Under Informative Sampling Designs

Unit-level models for survey data offer many advantages over their area-...
10/15/2019 ∙ by Paul A. Parker, et al. ∙ 0

• ### Small Area Estimation of Health Outcomes

Small area estimation (SAE) entails estimating characteristics of intere...
06/18/2020 ∙ by Jon Wakefield, et al. ∙ 0

• ### Small area estimation for grouped data

This paper proposes a new model-based approach to small area estimation ...
03/18/2019 ∙ by Yuki Kawakubo, et al. ∙ 0

• ### Computationally Efficient Deep Bayesian Unit-Level Modeling of Survey Data under Informative Sampling for Small Area Estimation

The topic of deep learning has seen a surge of interest in recent years ...
09/16/2020 ∙ by Paul A. Parker, et al. ∙ 0

• ### Computationally Efficient Bayesian Unit-Level Models for Non-Gaussian Data Under Informative Sampling

Statistical estimates from survey samples have traditionally been obtain...
09/11/2020 ∙ by Paul A. Parker, et al. ∙ 0

• ### Interpolating Distributions for Populations in Nested Geographies using Public-use Data with Application to the American Community Survey

Statistical agencies often publish multiple data products from the same ...
02/07/2018 ∙ by Matthew Simpson, et al. ∙ 0

• ### Multilevel Models Allow Modular Specification of What and Where to Regularize, Especially in Small Area Estimation

Through the lense of multilevel model (MLM) specification and regulariza...
05/21/2018 ∙ by Michael Tzen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Government agencies have seen an increase in demand for data products in recent years. One trend that has accompanied this demand is the need for granular estimates of parameters of interest at small spatial scales or subdomains of the finite population. Typically, sample surveys are designed to provide reliable estimates of the parameters of interest for large domains. However, for some subpopulations, the area-specific sample size may be too small to produce estimates with adequate precision. The term small area is used to refer to any domain of interest, such as a geographic area or demographic cross classification, for which the domain-specific sample size is not large enough for reliable direct estimation. To improve precision, model-based methods can be used to ‘borrow strength,’ by relating the different areas of interest through use of linking models, and by introducing area-specific random effects and covariates. The Small Area Income and Poverty Estimates (SAIPE) program, and the Small Area Health Insurance Estimates (SAHIE) program within the U. S. Census Bureau are two examples of government programs which produce county and sub-county level estimates for different demographic cross classifications across the entire United States using small area estimation (SAE) methods (Luery, 2011; Bauder et al., 2018). It can be difficult to generate small area estimates such as these for a number of reasons, including the fact that many geographic areas may have very limited sample sizes, if they have been sampled at all.

Models for SAE may be specified either at the area level or the unit level (see Rao and Molina, 2015, for an overview of small area estimation methodology). Area-level models treat the direct estimate (for example, the survey-weighted estimate of a mean) as the response, and typically induce some type of smoothing across areas. In this way, the areas with limited sample sizes may “borrow strength” from areas with larger samples. While area-level models are popular, they are limited, in that it is difficult to make estimates and predictions at a geographic or demographic level that is finer than the level of the aggregated direct estimate.

In contrast, unit-level models use individual survey units as the response data, rather than the direct estimates. Use of unit-level models can overcome some of the limitations of area-level models, as they constitute a bottom-up approach (i.e., they utilize the finest scale of resolution of the data). Since model inputs are at the unit-level (person-level, household-level, or establishment-level), predictions and estimates can be made at the same unit-level, or aggregated up to any desired level. Unit-level modeling also has the added benefit of ensuring logical consistency of estimates at different geographic levels. For example, model-based county estimates are forced to aggregate to the corresponding state-level estimates, eliminating the need for ad hoc benchmarking. In addition, because the full unit-level data set is used in the modeling, rather than the summary statistics used with area-level models, there is potential for improved precision of estimated quantities.

Although unit-level models may lead to more precise estimates that aggregate naturally across different spatial resolutions, they also introduce new challenges. Perhaps the biggest challenge is how to account for the survey design in the model. With area-level models, the survey design is incorporated into the model through specification of a sampling distribution (typically taken to be Gaussian) and inclusion of direct variance estimates. With unit-level models, accounting for the survey design is not as simple. One challenge is that the sample unit response may be dependent on the probability of selection, even after conditioning on the design variables. When the response variables are correlated with the sample selection variables, the sampling scheme is said to be

informative, and in these scenarios it is critical to capture the sample design by including the survey weights, or the design variables used to construct the survey weights in the model, in order to avoid bias.

The aim of this paper is to present a comprehensive literature review of unit-level small area modeling strategies, with an emphasis on Bayesian approaches, and to evaluate a selection of these strategies by fitting different unit-level models on both simulated data, and on real American Community Survey (ACS) data, thereby comparing model-based predictions and uncertainty estimates. In this paper we focus mainly on model specification and methods which incorporate informative sampling designs into the small area model. Some important, related issues, that will be outside the scope of this paper include issues related to measurement error and adjustments for nonresponse. Generally, we assume that observed survey weights have been modified to take into account nonresponse. We also avoid discussion on the relative merits of frequentist versus Bayesian methods for inference. In the simulation studies and data examples given in Sections 7 and 8, we fit three unit-level small area Bayesian models, with vague, proper priors on all unknown model parameters. Inference on the finite population parameters of interest is done using the posterior mean as a point estimate, and the posterior variance as a measure of uncertainty.

Some related work includes Hidiroglou and You (2016), who present a simulation study to compare area-level and unit-level models, both fit in a frequentist setting. They fit their models under both informative and noninformative sampling, and found that overall the unit-level models lead to better interval coverage and more precise estimates. Gelman (2007) discuss poststratification using survey data, and compare the implied weights of various models including hierarchical regression. Lumley et al. (2017) discuss general techniques for modeling survey data with included weights. They focus on frequentist pseudo-likelihood estimation as well as hypothesis testing. Kim et al. (2017) considers a psuedo-likelihood method with EM Monte Carlo estimation under a two‐-stage cluster sampling scheme. Chapter 7 of Rao and Molina (2015) provides an overview of some commonly used unit-level small area models. The current paper adds to this literature by providing a comprehensive review of unit-level small area modeling techniques, with a focus on methods which account for informative sampling designs. We mainly use Bayesian methods for inference, but note that many model-based methods are general enough to be implemented in either setting, and we highlight some scenarios where Bayesian methodology may be used.

The remainder of this paper is organized as follows. Section 2 introduces the sampling framework and notation to be used throughout the paper. We aim to keep the notation internally consistent. This may lead to differences compared to the original authors’ notation styles, but should lead to easier comparison across methodologies. In Section 3 we cover modeling techniques that assume a noninformative survey design. The basic unit-level model is introduced, as well as extensions of this model which incorporate the design variables and survey weights. Methods which allow for an informative design are then discussed, beginning in Section 4. Here, we discuss methods which adjust for an informative design using survey-weighted pseudo-likelihoods. In Section 5 we focus on models that use a sample distribution that differs from the population distribution. We conclude the review component of this paper in Section 6, where we will review models that are specific to a Binomial likelihood, as many variables collected from survey data are binary in nature . In Section 7 we compare three selected models to a direct estimator under a simulation study designed around American Community Survey (ACS) data. Specifically, this simulation examines three Bayesian methods that span different general modeling approaches (pseudo-likelihood, nonparametric regression on the weights, and differing sample/population likelihoods) with the goal of examining the utility of each approach. Similarly, Section 8 uses the same models for a poverty estimates application similar to the Small Area Income and Poverty Estimates program (SAIPE). Finally, we provide concluding remarks in Section 9.

## 2 Background and notation

Consider a finite population of size , which is subset into nonoverlapping domains, , where . These subgroups will typically be small areas of interest, or socio-demographic cross-classifications, such as age by race by gender within the different counties. We use to represent a particular response characteristic associated with unit , and

a vector of predictors for the response variables

.

Let be a vector of design variables which characterize the sampling process. For example, may contain geographic variables used for stratifying the population, or size variables used in a probability proportional to size sampling scheme. A sample is selected according to a sampling scheme with unit inclusion probabilities . Let denote the sampled units in small area , and let . The inverse probability sampling weights are denoted with . We note that as analysts, we may not have access to the functional form of , and may not even have access to the design variables , so that the only information available to us about the survey design is through the observed values of or , for the sampled units in the population. Finally, we let represent the observed data. This simply consists of the response, predictors, and sampling weights for all units included in the sample. In this context, is random, is typically considered fixed and know, and can either be fixed or random depending on the specific modeling assumptions.

The usual inferential goal, and the main focus of this paper, is on estimation of the small area means, , or totals, . In some situations, interest could be on estimation of descriptive population parameters, such as regression coefficients in a linear model, or on estimation of a distribution function. The best predictor, of , under squared error loss, given the observed data , is

 ^¯yi=E(¯yi∣DS)=1Ni∑j∈UiE(yij∣DS)=1Ni∑j∈Siyij+1Ni∑j∈SciE(yij∣DS). (1)

The first term on the right hand side of (1) is known from the observed sample. However, computation of the conditional expectation in the second term requires specification of a model, and potentially, depending on the model specified, auxiliary information, such as knowledege of the covariates or sampling weights for the nonsampled units. For the case where the predictors are categorical, the assumption of known covariates for the nonsampled units is not necessarily restrictive, if the totals, , for each cross-classification in each of the small areas are known. In this case, the last term in (1) reduces to , and only predictions for each cross-classification need to be made.

The predictor given in (1) is general, and the different unit-level modeling methods discussed in this paper are essentially different methods for predicting the nonsampled units under different sets of assumptions on the finite population and the sampling scheme. An entire finite population can then be generated, consisting of the observed, sampled values, along with model-based predictions for the nonsampled individuals. The small area mean can then be estimated by simply averaging appropriately over this population. If the sampling fraction in each small area is small, inference using predicted values for the entire population will be nearly the same as inference using a finite population consisting of the observed values and predicted values for the nonsampled units. In this situation, it may be more convenient to use a completely model-based approach for prediction of the small area means (Battese et al., 1988).

## 3 Unweighted analysis

### 3.1 Ignorable design

First, assume the survey design is ignorable or noninformative. Ignorable designs, such as simple random sampling with replacement, arise when the sample inclusion variable is independent of the response variable . In this situation, the distribution of the sampled responses will be identical to the distribution of nonsampled responses. That is, if a model is assumed to hold for all nonsampled units in the population, then it will also hold for the sampled units, since the sample distribution of , is identical to the population distribution of . In this case, a model can be fit to the sampled data, and the fitted model can then be used directly to predict the nonsampled units, without needing any adjustments due to the survey design.

The nested error regression model or, using the terminology of Rao and Molina (2015), the basic unit-level model, was introduced by Battese et al. (1988) for estimation of small area means using data obtained from a survey with an ignorable design. Consider the linear mixed effects model

 yij=xijβ+vi+eij, (2)

where indexes the different small areas of interest, and indexes the sampled units in small area . Here, the model errors, , are i.i.d. random variables, and the sampling errors, are i.i.d. random variables, independent of the model errors.

Let be the covariance matrix consisting of diagonal elements , and off-diagonal elements . Assuming (2) holds for the sampled units, and the variance parameters and are known, the best linear unbiased predictor (BLUP) of is

 ^¯yi=1Ni∑j∈Siyij+1Ni∑j∈Sci(xTij~β+~vi), (3)

where

 ~β=(m∑i=1XTiV−1iXi)−1(m∑i=1XTiV−1iyi),

is the matrix with rows , and . In (3), as in the general expression in (1), the unobserved are replaced by model predictions. Note that evaluation of (3) requires knowledge of the population mean, of the covariates.

In practice, the variance components and are unknown and need to be estimated. The empirical best linear unbiased predictor (EBLUP) is obtained by substituting estimates, and

, (typically MLE, REML, or moment estimates) of the variance components in the above expressions

(Prasad and Rao, 1990). In addition, this model could easily be fit using Bayesian hierarchical modeling rather than using the EBLUP, which would incorporate the uncertainty from the variance parameters. Molina et al. (2014) developed a Bayesian version of the nested error regression model (2), using noninformative priors on the variance components.

The survey weights do not enter into either the nested error regression model (2) or the EBLUPs of the small area means (3). Because of this, the EBLUP is not design-consistent, unless the sampling design is self-weighting within each small area (Rao and Molina, 2015). You and Rao (2002) proposed a pseudo-EBLUP of the small area means based on the nested error regression model (2), which incorporates the survey weights. In their approach, the regression parameters in (2) are estimated by solving a system of survey-weighted estimating equations

 m∑i=1∑j∈Siwijxij{yij−xTijβ−γiw(¯yiw−¯xTiwβ)}=0, (4)

where , , , and . This is an example of the pseudo-likelihood approach to incorporating survey weights, which is discussed in greater detail in Section 4.

The pseudo-BLUP is the solution to (4) when the variance components and are known, and the pseudo-EBLUP, , is the solution to (4) using plug-in estimates and of the variance components. The pseudo-EBLUP, of the small area mean is then

 ^θiw=^γiw¯yiw+(¯Xi−^γiw¯xiw)T^βw.

Similar to Battese et al. (1988), You and Rao (2002) assumed an ignorable survey design, so that the model (2) holds for both the sampled and nonsampled units. However, You and Rao (2002) showed that inclusion of the survey weights in the pseudo-EBLUP results in a design-consistent estimator. In addition, when the survey weights are calibrated to the population total, so that , the pseudo-EBLUP has a natural benchmarking property, without any additional adjustment, in the sense that

 m∑i=1Ni^θiw=^Yw+(X−^Xw)T^βw,

where and . That is, the weighted sum of area-level pseudo-EBLUPs is equal to a GREG estimator of the population total.

### 3.2 Including design variables in the model

Suppose now that the survey design is informative, so that the way in which individuals are selected in the sample depends in an important way on the value of the response variable . It is well established that when the survey design is informative, that ignoring the survey design and performing unweighted analyses without adjustment can result in substantial biases (Nathan and Holt, 1980; Pfeffermann and Sverchkov, 2007).

One method to eliminate the effects of an informative design is to condition on all design variables (Gelman et al., 1995, Chap. 7). To see this, decompose the response variables as , where are the observed responses for the sampled units in the population, and represents the unobserved variables corresponding to nonsampled individuals. Let be the matrix of sample inclusion variables, so that if is observed and otherwise. The observed data likelihood, conditional on covariate information , and model parameters and , is then

 f(ys,I∣X,θ,ϕ)=∫f(ys,yns,I∣X,θ,ϕ)dyns=∫f(I∣y,X,ϕ)f(y∣X,θ)dyns.

If , the inclusion variables are independent of , conditional on , and the survey design can be ignored. For example, if the design variables are included in , the ignorability condition may hold, and inference can be based on . The problem with attempting to eliminate the effect of the design by conditioning on design variables is often more of a practical one, because neither the full set of design variables, nor the functional relationship between the design and the response variables will be fully known. Furthermore, expanding the model by including sufficient design information so as to ignore the design may make the likelihood extremely complicated or even intractable.

Little (2012)

advocates for a general framework using unit-level Bayesian modeling that incorporates the design variables. For example, if cluster sampling is used, one could incorporate a cluster level random effect into the model, or if a stratified design is used, one might incorporate fixed effects for the strata. The idea is that when all design variables are accounted for in the model, the conditional distribution of the response given the covariates for the sampled units is independent of the inclusion probabilities. Because the model is unit-level and Bayesian, the unsampled population can be generated via the posterior predictive distribution. Doing so provides a distribution for any finite population quantity and incorporates the uncertainty in the parameters. For example, if the population response is generated at draw

of a Markov chain Monte Carlo algorithm,

, then one has implicitly generated a draw from the posterior distribution of the population mean for a given area :

 ¯yU(k)i=∑Nj=1yU(k)jI(j∈Ui)∑Nj=1I(j∈Ui).

If there are

total posterior draws, one could then estimate the mean and standard error of

with

 ^¯yUi=1KK∑j=1¯yU(j)i

and

 ˆSE(^¯yUi)=√Var(^¯yUi).

For complex survey designs, for those practitioners outside of a statistical agency, complete design knowledge might not be realistic. In these cases, the conditional independence assumption might not be appropriate. Nevertheless, it may be possible to evaluate this assumption by comparing a model without the weights to one that regresses on the weights as well as the design variables.

### 3.3 Poststratification

Little (1993) gives an overview of poststratification. To perform poststratification, the population is assumed to contain

categories, or poststratification cells, such that within each category units are independent and identically distributed. Usually these categories are cross-classifications of categorical predictor variables such as county, race, and education level. When a regression model is fit relating the response to the predictors, predictions can be generated for each unit within a cell, and thus for the entire population. Importantly, any desired aggregate estimates can easily be generated from the unit level population predictions.

Gelman and Little (1997) and Park et al. (2006) develop a framework for poststratification via hierarchical modeling. By using a hierarchical model with partial pooling, parameter estimates can be made for poststratification cells without any sampled units, and variance is reduced for cells having few sampled units. Gelman and Little (1997) and Park et al. (2006) provide an example for binary data that uses the following model

 yij|pij∼Bernoulli(pij)logit(pij)=x′ijββ=(γ1,…,γK)γkind∼Nck(0,σ2kIck),k=1,…,K, (5)

where

is a vector of dummy variables for

categorical predictor variables with classes in variable .

Bayesian inference can be performed on this model, leading to a probability, , that is constant within each cell . The number of positive responses within cell can be estimated with , and any higher level aggregate estimates can be made by aggregating the corresponding cells. In some scenarios, the number of units within each cell may not be known, in which case further modeling would be necessary.

## 4 Models with survey weight adjustments

### 4.1 Regressing on the Survey Weights

Prediction of small area quantities using (1) requires estimation of for all nonsampled units in the population. One of the main difficulties in using unit-level model-based methods is the lack of knowledge of the covariates, sampling weights, or population sizes associated with the nonsampled units and small areas, that are needed to make these model-based predictions. To overcome this difficulty, Si et al. (2015) modeled the observed poststratification cells , conditional on , using the multinomial distribution

 (n1,…,nm)∼Multinomial(n;N1/w1∑mi=1Ni/wi,…,Nm/wm∑mi=1Ni/wi) (6)

for poststratification cells .

This model assumes that the unique values of the sample weights determine the poststratification cells, and that the sampling weight and response are the only values known for sampled units. The authors state that in general, this assumption is untrue, because there will be cells with low probability of selection that do not show up in the sample, but the assumption is necessary in order to proceed with the model. This model yields a posterior distribution over the cell population sizes which can be used for poststratification with their response model, which uses a nonparametric Gaussian process regression on the survey weights,

 yij|μ(wi),σ2∼%N(μ(wi),σ2)μ(wi)|β,C(wi,wi′|θ)∼GP(wiβ,C(wi,wi′|θ))π(σ2,β,θ), (7)

for observation in poststratification cell . Here, represents a valid covariance function that depends on parameters

. The authors use a squared exponential function, but other covariance functions could be used in its place. The normal distribution placed over

could be replaced with another distribution in the case of non-Gaussian data. Specifically, the authors explore the Bernoulli response case. This model implicitly assumes that units with similar weights will tend to have similar response values, which is likely not true in general. However, in the absence of any other information about the sampled units, this may be the most practical assumption. Because Si et al. (2015) assume that only the survey weights and response values are known, this methodology cannot be used for small area estimation as presented. However, the model can be extended to include other variables such as county, which would allow for area level estimation.

Vandendijck et al. (2016) extend the work of Si et al. (2015) to be applied to small area estimation. They assume that the poststratification cells are designated by the unique weights within each area. Rather than using the raw weights, they use the weights scaled to sum to the sample size within each area. They then use a similar multinomial model to Si et al. (2015) in order to perform poststratification using the posterior distribution of the poststratification cell population sizes. Assuming a Bernoulli response, they use the data model

 yij|ηij∼Bernoulli(ηij)logit(ηij)=β0+μ(∼wij)+ui+vi (8)

for unit in small area , with designating the scaled weights. Independent area level random effects are denoted by , whereas denotes spatially dependent area level random effects, for which the authors use an intrinsic conditional autoregressive (ICAR) prior. They explore the use of a Gaussian process prior over the function as well as a penalized spline approach. For their Gaussian process prior, they assume a random walk of order one. The multinomial model

 (n1i,…,nLii)∼Multinomial⎛⎝ni;N1i/w(1)i∑Lil=1Nli/w(l)i,…,NLii/w(Li)i∑Lil=1Nli/w(l)i⎞⎠ (9)

is used for poststratification, where and represent the known sample size and unknown population size respectively for poststrata cell in area . The cells are determined by the unique weights in area , with the value of the weight represented by . Although Vandendijck et al. (2016) implement their model with a Bernoulli data example, this is a type of a Generalized Additive Model, and thus other response types in the exponential family may be used as well.

### 4.2 Pseudo-likelihood approaches

Suppose the finite population values , are independent, identically distributed realizations from a superpopulation distribution, . A sample, , is selected according to a survey design with inclusion probabilities . Standard likelihood analysis for inference on using the observed sample, , could produce asymptotically biased estimates when the sampling design is informative (Pfeffermann et al., 1998b). Pseudo-likelihood analysis, introduced by Binder (1983) and Skinner (1989), incorporates the survey weights into the likelihood for design-consistent estimation of .

The pseudo-log-likelihood is defined as

 ∑i∈Swilogfp(yi∣θ). (10)

Inference on can be based on the maximizer, (designating the maximum of the pseudo-likelihood rather than the likelihood), of (10), or equivalently, by solving the system

 ∑i∈Swi∂∂θlogfp(yi∣θ)=0. (11)

An important aspect of small area modeling is the introduction of area specific random effects to link the different small areas and to “borrow strength,” by relating the different areas through the linking model, and introducing auxiliary covariate information, such as administrative records. The presence of random effects and the multilevel structure of small area models means that neither the pseudo-likelihood method nor the related estimating function approach (discussed in Section 4.3) can be directly applied to small area estimation problems. However, Grilli and Pratesi (2004), Asparouhov (2006), Rabe-Hesketh and Skrondal (2006) extended the pseudo-likelihood approach to to accommodate models with hierarchical structure.

Let denote the area specific random effects with common density . The usual choice for is the mean zero normal distribution with unknown variance . Suppose now that the finite population in small areas are i.i.d. realizations from the superpopulation , and let be the sampled units in area . The census marginal log-likelihood is obtained by integrating out the random effects from the likelihood:

 logL(θ)=m∑i=1log∫∏j∈Uifi(yij∣θ,v)φ(v)dv=m∑i=1log∫exp⎧⎨⎩∑j∈Uilogfi(yij∣θ,v)⎫⎬⎭φ(v)dv. (12)

Suppose the survey weights are decomposed into two components, , and , where is the weight for unit in area , given that area has been sampled, and is the weight associated to small area . The pseudo-log-likelihood for the multilevel model can be defined by replacing in (12

) by the design-unbiased estimate,

, to get

 log^L(θ)=m∑i=1wilog∫exp⎧⎨⎩∑j∈Siwj∣ilogf(yij∣θ,v)⎫⎬⎭φ(v)dv. (13)

Analytical expressions for the maximizer of (13) generally do not exist, so the maximum pseudo-likelihood estimator, , must be found by numerical maximization of (13). Grilli and Pratesi (2004) used the NLMIXED procedure within SAS, using appropriately adjusted weights in the replicate statement and a bootstrap for mean squared error estimation. Rabe-Hesketh and Skrondal (2006) used an adaptive quadrature routine using the gllamm program within Stata, and derived a sandwich estimator of the standard errors, finding good coverages in their simulation studies with this estimate.

Rao et al. (2013) noted that both design consistency and design-model consistency of as an estimator of the finite population parameter , or the model parameter , respectively, requires that both the number of areas (or clusters), , and the number of elements within each cluster, , tend to infinity, and that the relative bias of the estimators can be large when the are small. Rao et al. (2013) showed that consistency can be achieved with only tending to infinity (allowing the to be small) if the joint inclusion probabilities, , are available. Their method is to use the marginal joint densities , of elements in a cluster, integrating out the random effects, in the pseudo-log likelihood, and to estimate by maximizing the design-weighted pseudo log likelihood

 lwC(θ)=∑i∈Swi∑j

It was shown in Yi et al. (2016) that the maximizer of (14), , is consistent for the second-level parameters , with respect to the joint superpopulation model and the sampling design.

There are two important considerations when using the pseudo-likelihood in multilevel models. The first is that two sets of survey weights, and (and in the case of the method of Rao et al. (2013), higher-order inclusion probabilities, ) are required, which is not typically the case; access to only the joint survey weights is not sufficient to use the multilevel models, unless all clusters sampled with certainty. The second consideration is that use of unadjusted, second level weights can cause significant bias in estimates of variance components. Scaling the second level weights by cluster has been suggested to reduce small sample bias. For single level models, scaling the weights by any constant factor does not change inference, as the solution to (11) is clearly invariant to any scaling of the weights. However, for multilevel models, the maximum pseudo-likelihood estimator and the associated mean squared prediction error may change depending on how the weights are scaled. Some suggestions include using scaled weights such that: 1) (Asparouhov, 2006; Grilli and Pratesi, 2004; Pfeffermann et al., 1998b), 2) constant scaling across clusters, so that (Asparouhov, 2006) and 3) , where is the effective sample size in cluster (Potthoff et al., 1992; Pfeffermann et al., 1998b; Asparouhov, 2006), and 4) unscaled (Pfeffermann et al., 1998b; Grilli and Pratesi, 2004; Asparouhov, 2006).

Different scaling methods have been reported to have different strengths and weaknesses, depending on the sampling design, although there seems to be agreement that using unscaled weights is generally unacceptable, except under noninformative sampling designs, as this has the potential to produce seriously biased estimates. In the special case of the linear model, different scaling factors have been studied, notably in Pfeffermann et al. (1998b) and Korn and Graubard (2003). Simulation studies indicate that regression coefficients can be estimated with little bias using unscaled survey weights, but that estimates of variance components can suffer from serious bias, particularly when the within cluster sample sizes are small. Asparouhov (2006) suggested weighting Method 1 when the sampling is not too informative and the area sizes are not small. Likewise, Pfeffermann et al. (1998b) tentatively recommended Method 1 as a default scaling method based on the results of numerical studies using nested error regression model (2) with a multilevel design, with Level 2 units sampled with probability proportional size, and Level 1 units selected by stratification. Pfeffermann et al. (1998b) reported better performance of Scaling Method 1 over Method 3 in terms of reducing bias, as Method 3 could ‘over correct’ for certain informative sample designs. However, Korn and Graubard (2003) showed that any scaling method can produce seriously biased variance estimates under informative sampling schemes, even with large sample sizes in each cluster. Korn and Graubard (2003) showed that it is possible to construct a nearly unbiased estimate of the variance component in a simple one-way random-effects model, regardless of the sampling design, although the construction requires knowledge of joint inclusion probabilities. There does not seem to be a single ‘best’ scaling method that can be used without consideration of the sampling scheme, or the working likelihood. Carle (2009) compared the different scaling methods in a series of simulation studies and recommended fitting the multilevel model with a variety of scaling methods, including unweighted methods, as a part of a sensitivity analysis.

The pseudo-log-likelihoods (10) and (13) suggest pseudo-likelihoods

 m∏i∏j∈Sif(yij∣θ)wij (15)

for single level models, and

 m∏i=1⎧⎨⎩∫∏j∈Sif(yij∣xij,θ,vj)wj∣iϕ(vj)dvj⎫⎬⎭wi (16)

for multilevel models (Asparouhov, 2006). The pseudo-likelihood (15) is sometimes called the composite likelihood in general statistical problems, when the weights (not necessarily survey weights) are known positive constants, and its use is popular in problems where the exact likelihood is intractable or computationally prohibitive (Varin et al., 2011).

The pseudo-likelihood (15) is not a genuine likelihood, as it does not incorporate the dependence structure in the sampled data nor the relationship between the responses and the design variables beyond inclusion of the survey weights. However, the pseudo-likelihood has been shown to be a useful tool for likelihood analysis for finite population inference in both the frequentist and Bayesian framework.

By treating the pseudo-likelihood as a genuine likelihood, and specifying a prior distribution on the model parameters , Bayesian inference can be performed on , which in turn allows for prediction of the nonsampled units. For general models, Savitsky and Toth (2016) showed for certain sampling schemes, and for a class of population distributions, that the pseudo-posterior distribution using the survey weighted pseudo-likelihood, with survey weights scaled to sum to the sample size, (15) converges in to the population posterior distribution. This result justifies use of (15) in place of the likelihood in Bayesian analysis of population parameters conditional on the observed sampled units, even when the sample design is informative.

While Savitsky and Toth (2016) showed asymptotic validity of the pseudo-posterior distribution, for finite samples, the pseudo-posterior distribution can be too concentrated, because the survey weights enter the pseudo-likelihood as frequency weights, so that each sampled observation enters the pseudo-likelihood as independent replicates. They recommend rescaling the weights to sum to the sample size.

The authors focus on parameter inference and do not give any advice for making area-level estimates. However, it is straightforward to model with a Bayesian pseudo-likelihood and then apply poststratification after the fact by generating the population, and thus any desired area-level estimates using (1). This type of pseudo-likelihood with poststratification for SAE was demonstrated in a frequentist setting by Zhang et al. (2014).

Ribatet et al. (2012) provides a discussion on the validity of Bayesian inference using the composite likelihood (15) in place of the exact likelihood in Bayes’ formula. An example of this method used in the sample survey context can be found in Dong et al. (2014), which used a weighted pseudo-likelihood with a multinomial distribution as a model for binned response variables. They assumed an improper Dirichlet distribution on the cell probabilities, and used the associated posterior and posterior predictive distributions for prediction of the nonsampled population units.

In a similar spirit, (Rao and Wu, 2010) use a Bayesian pseudo-empirical likelihood to create estimates for a population mean. They form the pseudo-empirical likelihood function as

 LPEL(p1,…,pn)=∏i∈Sp~wii

where the weights are scaled to sum to the sample size. They accompany this with a Dirichlet prior distribution over , and thus conjugacy yields a Dirichlet posterior distribution

 π(p1,…,pn|DS)=c(~w1+α1,…,~wn+αn)∏i∈Sp~wi+αi−1i,

where represents the normalizing constant. The posterior distribution of the population mean, , corresponds to the posterior distribution of . It is straightforward to use Monte Carlo techniques to sample from this posterior. The authors also note that the design weights can be replaced with calibration weights in order to include auxiliary variables.

### 4.3 Other survey weight adjustments

The system (11) is an example of the use of survey-weighted estimating functions for inference on a superpopulation parameter (Binder, 1983; Binder and Patak, 1994). Let be a general superpopulation parameter of interest, which is a function of the finite population values , that can be obtained as a solution to a “census” estimating equation

 Φ(y;θ)=N∑i=1ϕi(yi;θ)=0, (17)

where the are known functions of the data and the parameter, with mean zero under the superpopulation model. The term “census” is used to describe the estimating function (17), because (17) can only be calculated if all finite population values are observed, or if a census of the population is conducted.

The target parameter is defined implicitly as a solution to the census estimating equation (17). A point estimate for can be obtained by replacing in (17), by a design-based estimate . When the survey weights are inverses of the inclusion probabilities , a design-unbiased estimate of the census estimating function can be obtained with a Horvitz-Thompson version of the estimating function . The parameter can then be estimated using the observed sample by finding a root of the survey-weighted estimating function

 ∑i∈Swiϕi(yi;θ)=0. (18)

The use of an estimating function (rather than the score function) reduces the number of assumptions about the superpopulation that need to be made, as full distributional specification is not required, and instead only assumptions about the moment structure are needed. The choice of the specific estimating function may be motivated by a conceptual superpopulation model, or a finite population parameter of interest. Regardless of whether a superpopulation model is assumed, most finite population parameters of interest can be formulated as a solution to a census estimating equation, and well-known ‘model assisted’ estimators of the finite population parameters can be derived as solutions of survey-weighted estimating equations. For example, the estimating function leads to the population total, , and its estimator . If is known, the pair of estimating functions and , give the population total and its ratio estimator . If the covariate vector contains an intercept, the pair of estimating functions and leads to the finite population total and its GREG estimator , where .

Breidt et al. (2017) show that model-assisted estimators for a population total are of the general form

 ^ytot=∑j∈U^m(xj)+∑j∈Syj−^m(xj)πj,

where is a method or model for prediction that depends on the sample. A design based estimate of the variance can then be used for inference, assuming asymptotic normality, via

 ˆVar(^ytot)=∑j,l∈U(πjl−πjπl)yj−^m(xj)πjyl−^m(xl)πlIjIlπjl,

where denotes the pairwise inclusion probability of units and

. The authors review specific examples for linear mixed models as well as some nonlinear methods such as random forests and neural networks. Although model-assisted estimators use a model to improve the efficiency of the estimate, they are inherently still a design-based method.

Under general regularity conditions, the design-based limiting distribution of is a multivariate, mean zero normal distribution. The variance can be consistently estimated using a Taylor series linearization based ‘sandwich’ estimator , where and is a design-consistent estimate of the variance of (17) (Binder, 1983; Binder and Patak, 1994).

Godambe and Thompson (1986) showed an optimality property of the survey weighted estimating function (18) for inference on a finite population parameter defined implicitly as a solution to the census estimating equation (17), when the survey weights are reciprocals of the inclusion probabilities . Their result shows that the estimating function in (18) minimizes the distance to the census estimating function in (17) over the class of all unbiased estimating functions which are functions of the observed data. In addition, it was shown that this is equivalent to minimizing the quantity , which is the covariance matrix in the limiting normal distribution of . This implies that the asymptotic variance of the parameter estimate is minimized using the survey weighted estimating function (18), with weights proportional to the inverse of the inclusion probabilities.

## 5 Likelihood-based inference using the sample distribution

The pseudo-likelihood methods discussed in Section 4

require specification of a superpopulation model, which is a distribution which holds for all units in the finite population. However, validating the superpopulation model based on the observed sampled values is generally not possible, unless the sampling design is not informative, in which case, the distribution for the sampled units is the same as for the nonsampled units. Under an informative sampling design, the model for the population data does not hold for the sampling data. This can be seen by application of Bayes’ Theorem. Suppose the finite population values

are independent realizations from a population with density , conditional on a vector of covariates , and model parameters . Given knowledge of this superpopulation model, as well as the distribution of the inclusion variables, the distribution of the sampled values can be derived. Define the sample density, , (Pfeffermann et al., 1998a) as the density function of , given that has been sampled, that is,

 fs(yij∣xij,θ)=fp(yij∣xij,θ,Iij=1)=P(Iij=1∣yij,xij,θ)fp(yij∣xij,θ)P(Iij=1∣xij,θ). (19)

From (19), the sample distribution differs from the population distribution, unless , which occurs in ignorable sampling designs. Note that the inclusion probabilities, , may differ from the probabilities in (19), because the latter are not conditional on the design variables .

Equation (19) can be used for likelihood-based inference if the simplifying assumption that the sampled values are independent is made. While this is not true in general, asymptotic results given in Pfeffermann et al. (1998a) justify an assumption of independence of the data for certain sampling schemes when the overall sample size is large. However, direct use of (19) for finite population inference requires additional model specifications for the sample inclusion variables as well as . It was shown in Pfeffermann et al. (1998a) that , and that , so that a superpopulation model still needs to be specified for likelihood-based inference.

Ideally, one would like to specify a model for the sampled data, and to use this model fit to the sampled data to infer the nonsampled values, without specifying a superpopulation model. Pfeffermann and Sverchkov (1999) derived an important identity linking the moments of the sample and population-based moments, which allows for likelihood inference using the observed data, without explicit specification of a population model. They showed that

 P(Iij=1∣yij,xij)=Ep(πij∣yij,xij)=1/Es(wij∣yij,xij).

Similarly, it was shown that

 P(Iij=1∣xij)=Ep(πij∣xij)=1/Es(wij∣xij).

Combining these results with an application of Bayes’ Theorem, as was done to arrive at Equation (19), gives the distribution for the nonsampled units in the finite population

 fc(yij∣xij)≡fp(yij∣xij,Iij=0)=Es(wij−1∣yij,xij)fs(yij∣xij)Es(wij−1∣xij), (20)

where represents the density function of , given that has not been sampled. This result allows one to specify only a distribution for the sampled responses and a distribution for the sampled survey weights for inference on the nonsampled units, without any hypothetical distribution for the finite population. Importantly, this allows for identification of the finite population generating distribution through the sample-based likelihood. It also establishes the relationship between the moments of the sample distribution and the population distribution, allowing for prediction of nonsampled units.

In the small area estimation context, the goal is prediction of the small area means, , which requires estimation of in (1) for the nonsampled units in each area . Suppose there is an area-specific random effect, , common to all units in the population in small area , so that the population distribution can be written . Pfeffermann and Sverchkov (2007) used the result in (20), to show how small area means can be predicted using the observed unit level data under an informative survey design. Under the assumption that ,

 Ep(yij∣Ds,Iij=0)=Ec(yij∣Ds)=Ec(Ec(yij∣xij,vi)∣Ds).

Combining this with (20) allows for prediction of the small area means after specification of a model for the sampled responses, , and a model for the sampled weights, .

The model for the survey weights can be specified conditionally on the response variables to account for the informativeness of the survey design. Possible models for the sample weights considered in the literature include the linear model (Beaumont, 2008)

 wij=a0+a1yij+a2y2ij+xTijα+ϵij,

and the exponential model for the mean (Pfeffermann et al., 1998a; Kim, 2002; Beaumont, 2008)

 Es(wij∣xij,yij)=kiexp(ayij+xTijβ). (21)

Pfeffermann and Sverchkov (2007) considered the case of continuous response variables, , and modeled the sampled response data using the nested error regression model (2). The exponential model for the survey weights in (21) was used to model the informative survey design. Under this modeling framework, they showed that the best predictor of is approximately

 Ep(¯Yi∣Ds)=N−1i[(Ni−ni)^θi+ni{¯yi+(¯Xi−¯xi)Tβ}+(Ni−ni)bσ2e], (22)

where . The term in (22) is an additional term from the usual best predictor in the nested error regression model (2), which gives a bias correction proportional to the sampling error variance .

Novelo and Savitsky (2017) take a fully Bayesian approach by specifying a population level model for the response, , as well as a population level model for the inclusion probabilities, . Through a Bayes rule argument similar to (19

), they show that the implied joint distribution for the sampled units is

 fs(yij,πij|xij,θ)=fp(yij,πij|xij,θ,Iij=1)=πijfp(πij|yij,xij,θ)Eyij|xij,θ{E(πij|yij,xij,θ)}×fp(yij|xij,θ). (23)

This joint likelihood for the sample can then be used in a Bayesian model by placing a prior distribution on . Note that can be split into two vectors corresponding to and if desired. Consequently, the covariates for the response model and the inclusion probability model need not be the same.

Two computational concerns arise when using the likelihood as in (23). The first issue is that in general, the structure will not lead to conjugate full conditional distributions. To this effect, the authors recommend using the probabilistic programming language Stan (Carpenter et al. (2017)), which implements HMC for efficient mixing. The second concern is that the integral involved in the expectation term of (23) needs to be solved for every sampled observation at every iteration of the sampler. If the integral is intractable, it will need to be evaluated numerically, greatly increasing the necessary computation time. They show that if the lognormal distribution is used for the population inclusion probability model, then a closed form can be found for the expectation. Specifically, let , where represents a normal distribution with mean and variance , is a regression coefficient, and