 # An approximate Bayesian approach to regression estimation with many auxiliary variables

Model-assisted estimation with complex survey data is an important practical problem in survey sampling. When there are many auxiliary variables, selecting significant variables associated with the study variable would be necessary to achieve efficient estimation of population parameters of interest. In this paper, we formulate a regularized regression estimator in the framework of Bayesian inference using the penalty function as the shrinkage prior for model selection. The proposed Bayesian approach enables us to get not only efficient point estimates but also reasonable credible intervals for population means. Results from two limited simulation studies are presented to facilitate comparison with existing frequentist methods.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Probability sampling is a scientific tool for obtaining a representative sample from the target population. In order to estimate a finite population total from a target population, the Hotvitz-Thompson estimator obtained from a probability sample is often used, which satisfies consistency and the resulting inference is justified from the randomization perspective (Horvitz and Thompson, 1952). However, the Horvitz-Thompson estimator uses the first-order inclusion probability only and does not fully incorporate all available information from the finite population. To improve efficiency, regression estimation is often used to incorporate auxiliary information in survey sampling. Deville and Särndal (1992), Fuller (2002), Kim and Park (2010), and Breidt and Opsomer (2017) present comprehensive overviews of such variants of regression estimation in survey sampling.

The regression estimation approaches in survey sampling assume a model for the finite population, i.e., the superpopulation model, as

 yi=\boldmathxti\boldmathβ+ei, (1)

where and . The superpopulation model does not necessarily hold in the sample as the sampling design can be informative in the sense of Pfeffermann and Sverchkov (1999). Under the regression superpopulation model in (1), Isaki and Fuller (1982)

show that the asymptotic variance of the regression estimator achieves the lower bound of

Godambe and Joshi (1965)

. Thus, the regression estimator is asymptotically efficient in the sense of achieving the minimum variance under the joint distribution of the sampling design and the superpopulation model in (

1).

However, the above optimality of the regression estimator is untenable if the dimension of the auxiliary variables is large. When there are many auxiliary variables, the asymptotic bias of the regression estimator using all the auxiliary variables is no longer negligible and the resulting inference can be problematic. Simply put, including irrelevant auxiliary variables can introduce substantial variability in point estimation, but the uncertainty can fail to be fully accounted for by the standard linearization variance estimation, resulting in misleading inference.

To overcome the problem, Särndal and Lundström (2005) select a subset of the auxiliary variables for regression estimation. The classical model selection approach is based on a step-wise method. However, the step-wise methods will not necessarily produce the best model if there are redundant predictors. Another approach is to employ regularized estimation of regression coefficients. For example, McConville et al. (2017) propose a regularized regression estimation approach based on the LASSO penalty of Tibshirani (1996). However, there are two main problems in the regularization approach. First, the choice of the regularization parameter is somewhat unclear. Second, after model selection, the frequentist inference is notoriously difficult to make.

In this paper, we propose a unified Bayesian framework to handle regularized regression estimation. We first present a Bayesian approach for regression estimation when is fixed, using the approximate Bayesian approach considered in Wang et al. (2018). The proposed Bayesian method fully captures the uncertainty in parameter estimation for the regression estimator and has better coverage properties. Second, the proposed Bayesian method solves the problem of large in regularized regression estimation.

The penalty function for regularization is incorporated into the prior distribution and the uncertainty associated with model selection and parameter estimation is fully captured in the Bayesian machinery. Furthermore, the penalty parameter can be optimized by having its own prior distribution. The proposed method provides a unified approach to Bayesian inference with sparse regression estimation. It is a calibrated Bayesian (Little, 2012) in the sense that it is asymptotically equivalent to the frequentist design-based approach.

The paper is organized as follows. In Section 2, the basic setup is introduced. In Section 3, the approximate Bayesian inference using regression estimation is proposed under a fixed

. In Section 4, the proposed method is extended to handle sparse regression estimation using shrinkage prior distributions. In Section 5, the proposed method is extended to non-linear regression models. In Section 6, results from two limited simulation studies are presented. The proposed method is applied to the real data example in Section 7. Some concluding remarks are made in Section 8.

## 2 Basic setup

Consider a finite population of a known size . Associated with unit in the finite population, we consider measurement where

is the vector of auxiliary variables with dimension

and is the study variable of interest. We are interested in estimating the finite population mean from a sample selected by a probability sampling design. Let be the index set of the sample and we observe from the sample. The Horvitz-Thompson estimator is design unbiased but it is not necessarily efficient.

If the finite population mean is known, then we can improve the efficiency of by using the following regression estimator:

 ^¯Yreg=¯\boldmathXt^\boldmathβ+1N∑i∈A1πi(yi−% \boldmathxti^\boldmathβ) (2)

where is the first-order inclusion probability of unit , and is an estimator of in (1). Typically, we use obtained by minimizing the weighted quadratic loss

 Q(\boldmathβ)=∑i∈Aπ−1(yi−\boldmathxti\boldmathβ)2, (3)

motivated from model (1).

To derive the asymptotic properties of , we may use

 ^¯Yreg−¯Y =^¯YHT−¯Y+(¯\boldmathX−^¯\boldmathXHT)t^\boldmathβ =^¯YHT−¯Y+(¯\boldmathX−^¯\boldmathXHT)t\boldmathβ% ∗+Rn (4)

where and

 Rn=(¯\boldmathX−^¯\boldmathXHT)t(^\boldmathβ−\boldmathβ∗)

for any . If we choose and the dimension is fixed in the asymptotic setup, then we can obtain and safely use the main terms of (4) to describe the asymptotic behavior of . To emphasize its dependence on in the regression estimator, we can write . Roughly speaking, we can obtain

 √n{^¯Yreg(^\boldmathβ)−^¯Yreg(\boldmathβ∗)}=Op(n−1/2p) (5)

and, if then we can safely ignore the effect of estimating in the regression estimator. See Appendix A for a sketch proof of (5).

If, on the other hand, the dimension is large, then we cannot ignore the effect of estimating . In this case, we can consider using some variable selection idea to reduce the dimension of . For variable selection, we may employ techniques of regularized estimation of regression coefficients. The regularization method can be described as finding

 ^\boldmathβ(R)=argminβ{Q(% \boldmathβ)+pλ(\boldmathβ)}, (6)

where is defined in (3) and is a penalty function with parameter . Some popular penalty functions are presented in Table 1. Once the solution to (6) is obtained, then the regularized regression estimator is given by

 ^¯Yreg(^\boldmathβ(R))=¯% \boldmathXt^\boldmathβ(R)+1N∑i∈A1πi(yi−\boldmathxti^% \boldmathβ(R)). (7)

Statistical inference with the regularized regression estimator in (7) is not fully investigated in the literature. For example, Chen et al. (2018) consider the regularized regression estimator using adaptive LASSO of Zou (2006), but they assume the sampling design is non-informative and the uncertainty in model selection is not fully incorporated in their inference. Generally speaking, making frequentist inference after model selection is difficult. The approximated Bayesian method we propose in this paper will capture the full uncertainty in the Bayesian framework.

## 3 Approximate Bayesian survey regression estimation

Developing a design-based Bayesian inference under complex sampling is a challenging problem in statistics. Wang et al. (2018) propose the so-called approximate Bayesian method for design-based inference using asymptotic normality of a design-consistent estimator. Specifically, for a given parameter with a prior distribution , if one can find a design-consistent estimator of , then the approximate posterior distribution of is given by

 p(θ∣^θ)=f(^θ∣θ)π(θ)∫f(^θ∣θ)π(θ)dθ, (8)

where is the sampling distribution of

, which is often approximated by a normal distribution.

Drawing on this idea, one can develop an approximate Bayesian approach to capture the full uncertainty in the regression estimator. Let

 ^\boldmathβ=(∑i∈Aπ−1i% \boldmathxi\boldmathxti)−1∑i∈Aπ−1i\boldmathxiyi

be the design-consistent estimator of and be the corresponding asymptotic variance-covariance matrix of , given by

 ^\boldmathVβ=(∑i∈Aπ−1i% \boldmathxi\boldmathxti)−1⎛⎝∑i∈A∑j∈AΔijπij^ei\boldmathxiπi^ej\boldmathxtjπj⎞⎠(∑i∈Aπ−1i\boldmathxi\boldmathxti)−1, (9)

where , and is the joint inclusion probability of unit and . Under some regularity conditions, as discussed in Chapter 2 of Fuller (2009), we can establish

 ^\boldmathV−1/2β(^\boldmathβ−\boldmathβ)∣\boldmathβL⟶N(0,I) (10)

as , where

 \boldmathβ=(N∑i=1\boldmathxi% \boldmathxti)−1N∑i=1\boldmathxiyi.

Thus, using (8) and (10), we can obtain the approximate posterior distribution of as

 p(\boldmathβ|^\boldmathβ)=ϕp(^\boldmathβ;\boldmathβ,^% \boldmathVβ)π(\boldmathβ)∫ϕp(^\boldmathβ;\boldmathβ,^\boldmathVβ)π(\boldmathβ)d\boldmathβ, (11)

where denotes a -dimensional multivariate normal density and is a prior distribution for .

Now, we wish to find the posterior distribution of for a given . First, define

 ^¯Yreg(\boldmathβ)=¯\boldmathXt\boldmathβ+1N∑i∈A1πi(yi−\boldmathxti\boldmathβ).

Note that

is a design unbiased estimator of

, regardless of . Under some regularity conditions, we can show that follows a normal distribution asymptotically. Thus, we obtain

 ^¯Yreg(\boldmathβ)−¯Y√^Ve(\boldmathβ)∣¯Y,\boldmathβL⟶N(0,1), (12)

where

 ^Ve(\boldmathβ)=1N2∑i∈A∑j∈AΔijπij1πi1πj(yi−% \boldmathxti\boldmathβ)(yj−\boldmathxtj\boldmathβ), (13)

is a design consistent variance estimator of for given . We then use as the density for the approximate sampling distribution of in (12), where is the normal density function with mean and variance . Thus, the approximate posterior distribution of given can be defined as

 p(¯Y|^¯Yreg(\boldmathβ),\boldmathβ)∝ϕ(^¯Yreg(\boldmathβ);¯Y,^Ve(\boldmathβ))π(¯Y∣\boldmathβ), (14)

where is a conditional prior distribution of given . Without extra assumptions, we can use a flat prior distribution for . See Remark 1 below.

Therefore, combining (11) and (14), the approximate posterior distribution of can be obtained as

 p(¯Y|^¯Yreg(^\boldmathβ),^%\boldmath$β$)=∫ϕ(^¯Yreg(\boldmathβ);¯Y,^Ve(\boldmathβ))ϕp(^% \boldmathβ;\boldmathβ,^\boldmathVβ)π(\boldmathβ)π(¯Y∣\boldmathβ)d\boldmathβ∬ϕ(^¯Yreg(% \boldmathβ);¯Y,^Ve(\boldmathβ))ϕp(^\boldmathβ;\boldmathβ,^% \boldmathVβ)π(\boldmathβ)π(¯Y∣% \boldmathβ)d\boldmathβd¯Y. (15)

Generating posterior samples from (15) can be easily carried out via the following two steps:

• Generate posterior sample of from (11).

• Generate posterior sample of from the conditional posterior (14) given .

Based on the approximate posterior samples of , we can compute posterior mean as a point estimator as well as credible intervals for uncertainty quantification for including the variability in estimating .

###### Remark 1.

If an intercept term is included in , that is, , for some , then we have and the parameter is completely determined from . In this case, the posterior distribution in (15) reduces to

 p(¯Y|^¯Yreg(^\boldmathβ),^%\boldmath$β$)=∫p(\boldmathβ∣^% \boldmathβ)π(¯Y∣\boldmathβ)d% \boldmathβ,

where is defined in (11) and is a degenerating distribution at .

The following theorem presents an asymptotic property of the proposed approximate Bayesian method.

###### Theorem 1.

Under the regularity conditions described in the Appendix, conditional on the full sample data,

 sup¯Y∈ΘY∣∣p(¯Y|^¯Yreg(^\boldmathβ),^\boldmathβ)−ϕ(¯Y;^¯Yreg,^Ve)∣∣→0, (16)

as in probability, where is the feasible set for and is given in (15).

Theorem 1

is a special case of the Bernstein-von Mises theorem

(van der Vaart, 2000, Section 10.2) in survey regression estimation, and its proof is given in the Appendix. According to Theorem 1, the credible interval for constructed from the approximated posterior distribution (15

) is asymptotically equivalent to the frequentist confidence interval based on the asymptotic normality of the common survey regression estimator. Therefore, the frequentist survey regression estimator can be formally interpreted by the Bayesian inference. The consistency of the Bayesian point estimator (e.g. posterior mean) follows directly from (

16) since in probability as .

## 4 Approximate Bayesian method with shrinkage priors

We now consider the case when there are many auxiliary variables in applying regression estimation. When

is large, it is important to select suitable auxiliary variables that are associated with the response variable to present irrelevant covariates from rendering the resulting estimator inefficient. To this end, we assume that the regression model in (

1) contains an intercept term. That is,

 E(yi∣\boldmathxi)=β0+\boldmathxti\boldmathβ1, (17)

where is an intercept term.

To deal with the problem in the Bayesian way, we may define the approximate posterior distribution of given both and as similar to (15). That is, we use the asymptotic distribution of the estimators and of and , respectively, and assign a shrinkage prior for and flat prior for . Let be the shrinkage prior for with a structural parameter which might be multivariate.

Among the several choices of shrinkage priors, we specifically consider two priors for : Laplace (Park and Casella, 2008) and horseshoe (Carvalho et al., 2009, 2010). The Laplace prior is given by

, which is related to Lasso regression

(Tibshirani, 1996), so that the proposed approximated Bayesian method can be seen as the Bayesian version of a survey regression estimator with Lasso (McConville et al., 2017). The horseshoe prior is a more advanced shrinkage prior with the form:

 πλ(\boldmathβ1)=p∏k=1∫∞0ϕ(βk;0,λ2u2k)2π(1+u2k)duk, (18)

where denotes the normal density function with mean and variance . It is known that the horseshoe prior enjoys more severe shrinkage for the zero elements of than the Laplace prior, thus allowing strong signals to remain large (Carvalho et al., 2009).

Let be the asymptotic variance-covariance matrix of . Then, under the flat prior for , the approximate posterior distribution of can be defined as

 pλ(β0,\boldmathβ1|^β0,^\boldmathβ1)=ϕ((^β0,^% \boldmathβt1);(β0,\boldmathβ1),^\boldmathVβ)πλ(\boldmathβ1)∬ϕ((^β0,^\boldmathβt1);(β0,\boldmathβ1),^\boldmathVβ)πλ(\boldmathβ1)dβ0d\boldmath% β1. (19)

The marginal posterior of is given by

 pλ(\boldmathβ1|^\boldmathβ1)=ϕ(^\boldmathβ1;\boldmathβ1,^Vβ11)πλ(\boldmathβ1)∫ϕ(^\boldmathβ1;\boldmathβ1,^Vβ11)πλ(\boldmathβ1)d\boldmathβ1, (20)

where is the asymptotic variance-covariance matrix of , which is a sub-matrix of . Under both shrinkage priors, we can derive efficient algorithms for doing posterior computations of as well as . The details are provided in the Appendix. On the other hand, the conditional posterior of given is the normal distribution with mean and variance , where

 ^\boldmathVβ=(^Vβ00^Vβ01^Vβ10^Vβ11).

Thus, we can generate posterior samples of and from (19

) via Markov Chain Monte Carlo in which we iteratively sample from the marginal posterior distribution of

and conditional posterior distribution of given . Once are sampled from (19), we can use (14) to obtain the posterior distribution of for a given .

Therefore, the approximate posterior distribution of can be obtained as

 pλ(¯Y|^¯Yreg(^\boldmathβ),^\boldmathβ)=∫ϕ(^¯Yreg(\boldmathβ);¯Y,^Ve(\boldmathβ))ϕ((^β0,^\boldmathβt1);(β0,\boldmathβ1),^\boldmathVβ)πλ(\boldmathβ1)π(¯Y∣\boldmathβ)d\boldmathβ1∬ϕ(^¯Yreg(\boldmathβ);¯Y,^Ve(\boldmathβ))ϕ((^β0,^\boldmathβt1);(β0,\boldmathβ1),^% \boldmathVβ)πλ(\boldmathβ1)π(¯Y∣\boldmathβ)d\boldmathβ1d¯Y, (21)

where . Generating posterior samples from (21) can be easily carried out via the following two steps:

• For a given , generate posterior sample of from (19).

• Generate posterior sample of from the conditional posterior (14) given .

###### Remark 2.

Let and be the estimator of and defined as

 (^β(R)0,^\boldmathβ(R)1)=argminβ0,β1{∑i∈A1πi(yi−β0−\boldmathxti\boldmathβ1)2+Pλ(\boldmathβ1)}, (22)

where is the penalty (regularization) term for induced from prior . For example, the Laplace prior for leads to the penalty term , in which corresponds to the regularized estimator of used in McConville et al. (2017). Since the exponential of is close to the approximated likelihood used in the approximated Bayesian method when is large, the mode of the approximated posterior of would be close to the frequentist estimator (22) as well.

###### Remark 3.

By the frequent approach, is often called the tuning parameter and can be selected via a data-dependent procedure such as cross validation as used in McConville et al. (2017)

. On the other hand, in the Bayesian approach, we assign a prior distribution on the hyperparameter parameter

and consider integration with respect to the posterior distribution of , which means that uncertainty of the hyperparameter estimation can be taken into account. Specifically, we assign a gamma prior for as the Laplace prior and a half-Cauchy prior for as the horseshoe prior (18). They both lead to familiar forms of full conditional posterior distributions of or . The details are given in the Appendix.

As in Section 3, we obtain the following asymptotic properties of the proposed approximate Bayesian method.

###### Theorem 2.

Under the regularity conditions described in the Appendix, conditional on the full sample data,

 sup¯Y∈ΘY∣∣pλ(¯Y|^¯Yreg(^\boldmathβ),^β0,^\boldmathβ1)−ϕ(¯Y;^¯Yreg(^\boldmathβ(R)),^Ve(^β(R)0,^\boldmathβ(R)1))∣∣→0, (23)

as in probability, where is the feasible set for and is given in (21).

Theorem 2 ensures that the proposed approximate Bayesian method is asymptotically equivalent to the frequentist version in which is estimated by the regularized method with penalty corresponding to the shrinkage prior used in the Bayesian method. Moreover, the proposed Bayesian method can be extended to cases using general non-linear regression, as demonstrated in the next section.

## 5 An Extension to non-linear models

The proposed Bayesian methods can be readily extended to work with non-linear regression. Some extensions of the regression estimator to nonlinear models are also considered in Wu and Sitter (2001), Breidt et al. (2005), and Montanari and Ranalli (2005).

We consider a general working model for as and for some known functions and . The model-assisted regression estimator for with known is then

 ^¯Yreg,m(\boldmathβ)=1N{N∑i=1m(\boldmathxi;\boldmathβ)+∑i∈A1πi(yi−m(\boldmathxi;\boldmathβ))},

and its design-consistent variance estimator is obtained by

 ^Ve,m(\boldmathβ)=1N2∑i∈A∑j∈AΔijπij1πi1πj{yi−m(\boldmathxi;\boldmathβ)}{yj−m(% \boldmathxj;\boldmathβ)},

which gives the approximate conditional posterior distribution of given . That is, similarly to (14), we can obtain

 p(¯Y|^¯Yreg,m(\boldmathβ),% \boldmathβ)∝ϕ(^¯Yreg,m(\boldmathβ);¯Y,^Ve,m(\boldmathβ))π(¯Y∣\boldmathβ). (24)

To generate the posterior values of , we first find a design-consistent estimator of . Note that a consistent estimator can be obtained by solving

 ^U(\boldmathβ)≡∑i∈Aπ−1i{yi−m(\boldmathxi;\boldmathβ)}h(% \boldmathxi;\boldmathβ)=0,

where . For example, for binary , we may use a logistic model with and , which leads to .

Under some regularity conditions, we can establish the asymptotic normality of . That is,

 ^\boldmathV−1/2β(^\boldmathβ−\boldmathβ)∣\boldmathβL⟶N(0,I),

where

 ^\boldmathVβ={∑i∈A1πi^hi˙m(\boldmathxi;^% \boldmathβ)t}−1⎛⎜⎝∑i∈A∑j∈AΔijπij^ei^hiπi^ej^htjπj⎞⎟⎠{∑i∈A1πi^hi˙m(\boldmathxi;^\boldmathβ)t}−1,

with , , and . Note that under a logistic model.

Thus, the posterior distribution of given can be obtained by

 p(\boldmathβ∣^\boldmathβ)∝ϕ(^\boldmathβ∣\boldmathβ,^% \boldmathVβ)π(\boldmathβ). (25)

We can use a shrinkage prior for in (25) if necessary. Once is generated from (25), the posterior values of are generated from (24) for a given .

This formula enables us to define the approximate posterior distribution of of the form (11), so that the approximate Bayesian inference for can be carried out in the same way as in the linear regression case. Note that Theorem 1 still holds under the general setup as long as the regularity conditions given in the Appendix are satisfied.

## 6 Simulation

We investigate the performance of the proposed approximate Bayesian methods against standard frequentist methods using two limited simulation studies. In the first simulation, we consider a linear regression model for a continuous variable. In the second simulation, we consider a binary

and apply the logistic regression model for the non-linear regression estimation.

### 6.1 Simulation study: linear regression

In the first simulation, we generate , , from a multivariate normal distribution with mean vector and variance-covariance matrix , where and the -th element of is . The response variables are generated from the following linear regression model:

 Yi=β0+β1xi1+⋯+βp∗xip∗+εi,    i=1,…,N,

where , , , , , and the other ’s are set to zero. For the dimension of the auxiliary information, we consider four scenarios for of and . For each , we assume that we can access only a subset of the full information . Note that for all scenarios the auxiliary variables significantly related with are included, and so only the amount of irrelevant information gets larger as gets larger. We consider two scenarios for the sampling probability: (A) and (B) . The sampled units are selected via Poisson sampling, which leads to an average sample size of around 400 in both scenarios. The parameter of interest is . We assume that is known for all .

For the simulated dataset, we apply the proposed approximate Bayesian methods with the uniform prior , Laplace prior and horseshoe prior (18) for , which are denoted by AB, ABL and ABH, respectively. For all the Bayesian methods, we use . We generate 5,000 posterior samples of after discarding the first 500 samples and compute the posterior mean of as the point estimate. As for the frequentist methods, we apply the original generalized regression estimator without variable selection (GREG) as well as the GREG method with Lasso regularization (GREG-L; McConville et al., 2017) and ridge estimation of (GREG-R; Rao and Singh, 1997). We also apply the Horwitz-Thompson (HT) estimator as a benchmark for efficiency comparison. In GREG-L, the tuning parameter is selected via 10-fold cross validation, and we use the gamma prior