Probability sampling is regarded as the gold-standard in survey statistics for finite population inference. Fundamentally, probability samples are selected under known sampling designs and therefore are representative of the target population. However, many practical challenges arise in collecting and analyzing probability sample data such as cost, time duration, and increasing non-response rates (Keiding and Louis, 2016). As the advancement of technology, big non-probability samples become increasingly available for research purposes, such as remote sensing data, Internet samples, etc. Although non-probability samples do not contain information on the sampling mechanism, they provide rich information about the target population and can be potentially helpful for finite population inference. These complementary features of probability samples and non-probability samples raise the question of whether it is possible to develop data integration methods that leverage the advantages of both data sources.
Existing methods for data integration can be categorized into three types. The first type is the so-called propensity score adjustment (Rosenbaum and Rubin, 1983). In this approach, the probability of a unit being selected into the non-probability sample, which is referred to as the propensity or sampling score, is modeled and estimated for all units in the non-probability sample. The subsequent adjustments, such as propensity score weighting or stratification, can then be used to adjust for selection biases; see, e.g., Lee and Valliant (2009), Valliant and Dever (2011), Elliott et al. (2017) and Chen et al. (2018). Stuart et al. (2011; 2015) and Buchanan et al. (2018) use propensity score weighting to generalize results from randomized trials to a target population. O’Muircheartaigh and Hedges (2014) propose propensity score stratification for analyzing a non-randomized social experiment. One notable disadvantage of the propensity score methods is that they rely on an explicit propensity score model and are biased and highly variable if the model is misspecified (Kang and Schafer, 2007). The second type uses calibration weighting (Deville and Särndal, 1992, Kott, 2006). This technique calibrates auxiliary information in the non-probability sample with that in the probability sample, so that after calibration the weighted distribution of the non-probability sample is similar to that of the target population (DiSogra et al., 2011)
. The third type is mass imputation, which imputes the missing values for all units in the probability sample. In the usual imputation for missing data analysis, the respondents in the sample constitute a training dataset for developing an imputation model. In the mass imputation, an independent non-probability sample is used as a training dataset, and imputation is applied to all units in the probability sample; see, e.g.,Breidt et al. (1996), Rivers (2007), Kim and Rao (2012), Chipperfield et al. (2012), Bethlehem (2016), and Yang and Kim (2018).
be a vector of auxiliary variables (including an intercept) that are available from two data sources, and letbe the study variable of interest. We consider combining a probability sample with , referred to as Sample A, and a non-probability sample with , referred to as Sample B, to estimate the population mean of . Because the sampling mechanism of a non-probability sample is unknown, the target population quantity is not identifiable in general. Researchers rely on an identification strategy that requires a non-informative sampling assumption imposed on the non-probability sample. To ensure this assumption holds, researchers should control for all covariates that are predictors of both sampling and the outcome variable. In practice, subject matter experts will recommend a rich set of potential useful variables but will not identify the exact variables to adjust for. In the presence of many auxiliary variables, variable selection becomes important, because existing methods may become unstable or even infeasible, and irrelevant auxiliary variables can introduce a large variability in estimation. There is a large literature on variable selection methods for prediction, but little work on variable selection for data integration that can successfully recognize the strengths and the limitations of each data source and utilize all information captured for finite population inference. Gao and Carroll (2017) propose a pseudo-likelihood approach to combining multiple non-survey data with high dimensionality; this approach requires all likelihoods be correctly specified and therefore is sensitive to model misspecification. Robust inference has not been addressed in the context of data integration with high-dimensional data.
We propose a doubly robust variable selection and estimation strategy that harnesses the representativeness of the probability sample and the outcome and covariate information in the non-probability sample. The double robustness entails that the final estimator is consistent for the true value if either the probability of selection into the non-probability sample, referred to as the sampling score, or the outcome model is correctly specified, not necessarily both (a double robustness condition); see, e.g., Bang and Robins (2005), Tsiatis (2006), Cao et al. (2009), and Han and Wang (2013). To handle potentially high-dimensional covariates, our strategy separates the variable selection step and the estimation step for finite population mean to achieve two different goals.
In the first step, we select a set of variables that are important predictors of either the sampling score or the outcome model by penalized estimating equations. Following most of the empirical literature, we assume the sampling score follows a logistic regression model with the unknown parameterand the outcome follows a generalized linear model (accommodating different types of the outcome) with the unknown parameter . Importantly, we separate the estimating equations for and in order to achieve stability in variable selection under the double robustness condition. Specifically, we construct the estimating equation for by calibrating the weighted average of from Sample B, weighted by the inverse of the sampling score, to the design weighted average of from Sample A (i.e., a design estimate of population mean of ). We construct the estimating equation for by minimizing the standard least squared error loss under the outcome model. To establish the selection properties, we consider the “large , diverging
” framework. To the best of our knowledge, the asymptotic properties of penalized estimating estimation based on survey data have not been studied in the literature. Our major technical challenge is that under the finite population framework, the sampling indicator of Sample A may not be independent even under simple random sampling. To overcome this challenge, we construct martingale random variables with a weak dependence that allows applying Bernstein inequality. This construction is innovative and crucial in establishing our new selection consistency result.
In the second step, we consider a doubly robust estimator of , and re-estimate based on the joint set of variables selected from the first step. We propose using different estimating equations to estimate , derived by minimizing the asymptotic squared bias of . This estimation strategy is not new; see, e.g., Kim and Haziza (2014) for missing data analyses and Vermeulen and Vansteelandt (2015; 2016) for causal inference of treatment effects in low-dimensional data; however, we demonstrate its new role in high-dimensional data to mitigate the possible selection error in the first step. In essence, our strategy for estimating renders the first order term in the Taylor expansion of with respect to
to be exactly zero, and the remaining terms are negligible under regularity conditions. Therefore, the proposed estimator allows model misspecification of either the sampling score or the outcome model. Moreover, we propose a consistent variance estimator allowing for doubly robust inferences.
The paper proceeds as follows. Section 2 provides the basic setup of the paper. Section 3 presents the proposed two-step procedure for variable selection and doubly robust estimation of the finite population mean. Section 4 describes the computation algorithm for solving penalized estimating equations. Section 5 presents the theoretical properties for variable selection and doubly robust estimation. Section 6 reports simulation results that illustrate the finite-sample performance of the proposed method. In Section 7, we present an application to analyze a non-probability sample collected by the Pew Research Centre. Section 8 concludes with a discussion. We relegate all proofs to the supplementary material.
2 Basic Setup
2.1 Notation: Two Samples
Let be the index set of units for the finite population, with being the known population size. The finite population consists of . Let the parameter of interest be the finite population mean . We consider two data sources: one from a probability sample, referred to as Sample A, and the other one from a non-probability sample, referred to as Sample B. Table 1 illustrates the observed data structure. Sample A consists of observations with sample size where is known throughout Sample A, and Sample B consists of observations with sample size . We define and to be the indicators of selection to Sample A and Sample B, respectively. Although the non-probability sample contains rich information on , the sampling mechanism is unknown, and therefore we cannot compute the first-order inclusion probability for Horvitz–Thompson estimation. The naive estimators without adjusting for the sampling process are subject to selection biases (Meng, 2018). On the other hand, although the probability sample with sampling weights represents the finite population, it does not observe the study variable of interest.
|Sample weight||Covariate||Study Variable|
Sample A is a probability sample, and Sample B is a non-probability sample.
2.2 An Identification Assumption
Before presenting the proposed methodology for integrating the two data sources, we first discuss the identification assumption. Let be the conditional distribution of given in the superpopulation model that generates the finite population. We make the following primary assumption.
(i) The sampling indicator of Sample B and the response variable
of Sample B and the response variableis independent given ; i.e. , referred to as the sampling score , and (ii) for all , where .
Assumption 1 (i) implies that , denoted by , can be estimated based solely on Sample B. Assumption 1 (ii) specifies a lower bound of for the technicality in Section 5. A standard condition in the literature imposes a strict positivity in the sense that ; however, it implies that , which may be restrictive in survey sampling. Here, we relax this condition and allow , where can be strictly less than .
Assumption 1 is a key assumption for identification. Under Assumption 1, is identifiable based on Sample A by or Sample B by . However, this assumption is not verifiable from the observed data. To ensure this assumption holds, researchers often consider many potentially predictors for the sampling indicator or the outcome , resulting in a rich set of variables in .
2.3 Existing Estimators
In practice, the sampling score function and the outcome mean function are unknown and need to be estimated from the data. Let and be the posited models for and , respectively, where and are unknown parameters. First, under Assumption 1, we can obtain by fitting the outcome model based solely on . Second, following Valliant and Dever (2011), we can obtain by fitting the sampling score model based on the blended data , weighted by the design weights from Sample A. The resulting estimator is valid if the size of Sample B is relatively small (Valliant and Dever, 2011).
Given and , researchers have proposed different estimators for . We provide examples below and discuss their properties and limitations.
Example 1 (Inverse probability of sampling score weighting)
The inverse probability of sampling score weighting estimator is
The justification for relies on the correct specification of and the consistency of . If is misspecified or is inconsistent, is biased.
Example 2 (Calibration weighting)
The calibration weighting estimator is
The justification for relies on the linearity of the outcome model, i.e., for some , or the linearity of the inverse probability of sampling weight, i.e., for some (Fuller, 2009; Theorem 5.1). The linearity conditions are unlikely to hold for non-continuous variables. In these cases, is biased.
Example 3 (Outcome regression based on Sample A)
The outcome regression estimator is
The justification for relies on the correct specification of and the consistency of . If is misspecified or is inconsistent, is biased.
Example 4 (Doubly robust estimator)
The doubly robust estimator is
The estimator is doubly robust with fixed-dimensional (Chen et al., 2018), in the sense that it achieves the consistency if either or is correctly specified, but not necessarily both. The double robustness is attractive; therefore, we shall investigate the potential of in high-dimensional data.
3 Methodology in High-Dimensional Data
A major challenge arises in the presence of a large number of covariates, not all of them are necessary for making inference of the population mean of the outcome. This necessitates variable selection. For simplicity of exposition, we introduce the following notation. For any vector , denote the number of nonzero elements in as , the -norm as , the -norm as , and the -norm as . For any , let be the sub-vector of formed by elements of whose indexes are in . Let be the complement of . For any and matrix , let be the sub-matrix of formed by rows in and columns in . Following the literature on variable selection, we can first standardize the covariates so that approximately they have variances equal to one to stabilize the variable selection procedure. We make the following modeling assumptions.
Assumption 2 (Sampling score model)
a logistic regression model for ; i.e.,
for . Define to be the -dimensional
parameter that minimizes the Kullback-Leibler divergence
-dimensional parameter that minimizes the Kullback-Leibler divergence
Assumption 3 (Outcome model)
We assume a generalized linear regression model
We assume a generalized linear regression model for; i.e. for , where is a link function by an abuse the notation. Define
The models and are working models and they may be misspecified. If the sampling score model is correctly specified, . If the outcome model is correctly specified, .
The proposed procedure consists of two steps: the first step selects important variables in the sampling score model and the outcome model, and the second step focuses on doubly robust estimation of the population mean.
In the first step, we propose solving penalized estimating equations for variable selection. To select important variables in
, the traditional loss function under the logistic regression model,
is not feasible, because it requires the availability of the population information on . To overcome this difficulty, the key insight is that under Assumption 2,
Therefore, we define the estimating function for as
To select important variables in , under Assumption 1, we have . Therefore, we define the estimating function for as
Let be the joint estimating function for . When is large, following Johnson et al. (2008), we consider solving the penalized estimating function
for , where and are some continuous functions, is the element-wise product of and and is the element-wise product of and . We let , where is some penalization function. Although the same discussion applies to different non-concave penalty functions, we specify to be a folded-concave smoothly clipped absolute deviation (SCAD) penalty function (Fan and Lv, 2011). Accordingly, we have
for , where is the truncated linear function; i.e., if , , and if , . Following the suggestion of Fan and Lv (2011), we use . We select the variables if the corresponding estimates of (5) are nonzero in either the sampling score or the outcome model, indexed by .
To help understand the penalized estimating equation, we discuss two scenarios. If is large, then is zero, and therefore is not penalized. Whereas, if is small but nonzero, then is large, and therefore is penalized with a penalty term. The penalty term then forces to be zero and excludes the th element in from the final selected set of variables. The same discussion applies to and .
In the second step, we consider the estimator of the population mean in (4) with re-estimated based on . As we will show in Section 5, contains the true important variables in either the sampling score model or the outcome model with probability approaching one (the oracle property). Therefore, if either the sampling score model or the outcome model is correctly specified, the asymptotic bias of is zero; however, if both models are misspecified, the asymptotic bias of is
In order to minimize , we consider the joint estimating function
for estimating , constrained on .
The two steps use different estimating functions (5) and (7), respectively, for selection and estimation with the following advantages. First, (5) separates the selection for and in and , so it stabilizes the selection procedure if either the sampling score model or the outcome model is misspecified. Second, using the joint estimating function (7) for estimation leads to an attractive feature in the estimation of : this estimating strategy mitigates the possible first-step selection error and renders root- consistent if either the sampling probability or the outcome model is correctly specified in high-dimensional data. We relegate the details to Section 5.
In summary, our two-step procedure for variable selection and estimation is as follows. spacing
To facilitate joint selection of variables for the sampling score and outcome, solve the penalized joint estimating equations in (5), denoted by . Let and .
Let the set of variables for estimation be . Obtain the proposed estimator as
where and are obtained by solving the joint estimating equations (7) for and with and .
Variable selection circumvents the instability or infeasibility of direct estimation of with high-dimensional . Moreover, in Step 2 for estimation, we consider a union of covariates , where . Brookhart et al. (2006) and Shortreed and Ertefaie (2017) show that including variables that are related to the outcome in the propensity score model will increase the precision of the estimated average treatment effect without increasing bias. This implies that an efficient variable selection and estimation method should take into account both sampling-covariate and outcome-covariate relationships. As a result, may have a better performance than the oracle estimator which uses the true important variables in the sampling score and the outcome model. This is particularly true when one of the models is misspecified. Our simulation study in Section 6 demonstrates that with variable selection has a similar performance as the orcale estimator for the continuous outcome and outperforms the oracle estimator for the binary outcome.
In this section, we discuss the computation for solving the penalized estimating function (5). Following Johnson et al. (2008), we use an iterative algorithm that combines the Newton–Raphson algorithm for solving estimating equation and the minorization-maximization algorithm for non-convex penalty of Hunter and Li (2005).
First, by the minorization-maximization algorithm, the penalized estimator solving (5) satisfies
for is a predefined small number. In our implementation, we choose to be .
Second, we solve (9) by the Newton-Raphson algorithm. It may be challenging to implement the Newton-Raphson algorithm directly, because it involves inverting a large matrix. For ease and stability in those cases, we can use a coordinate decent algorithm (Friedman et al., 2007) by cycling through and updating each of the coordinates.
Following most of the empirical literature, we assume that follows a logistic regression model. Define for . We denote
Let start at an initial value . With the other coordinates fixed, the th Newton-Raphson update for (), the th element of , is
where and are the th diagonal elements in and , respectively. The procedure cycles through all the elements of and is repeated until convergence.
We use -fold cross-validation to select the tuning parameter . To be specific, we partition both samples into approximately equal sized subsets and pair subsets of Sample A and subsets of Sample B randomly. Of the pairs, we retain one single pair as the validation data and the remaining pairs as the training data. We fit the models based on the training data and estimate the loss function based on the validation data. We repeat the process times, with each of the pairs used exactly once as the validation data. Finally, we aggregate the estimated loss function. We select the tuning parameter as the one that minimizes the aggregated loss function over a pre-specified grid.
Because the weighting estimator uses the sampling score to calibrate the distribution of between Sample B and the target population, we use the following loss function for selecting :
where is the penalized estimator with the tuning parameter . We use the prediction error loss function for selecting :
where is the penalized estimator with the tuning parameter .
5 Asymptotic Results for Variable Selection and Estimation
We establish the asymptotic properties for the proposed double variable selection and doubly robust estimation. We can establish theoretical results for general sampling mechanisms for Sample A requiring specific regularity conditions. In this section, for technical convenience, we assume that Sample A is collected by simple random sampling or Poisson sampling with the following regularity conditions.
For all , , where .
Let the support of model parameters be
Define , , , and .
The following regularity conditions hold.
The parameter belongs to a compact subset in , and lies in the interior of the compact subset.
are fixed and uniformly bounded.
There exist constants and such that
are the minimum and the maximum eigenvalue of a matrix, respectively.
Let be the th residual. There exists a constant such that for all and some . There exist constants and such that for all .
, , and are uniformly bounded away from on for some .
and as .
, , , , , , as .
These assumptions are typical in the penalization literature. (A2) specifies a fixed design which is well suited under the finite population inference framework.
(A4) holds for Gaussian distribution, sub-Gaussian distribution, and so on. (A5) holds for common models. For example, for the linear regression model withthen
For the logistic regression model with , then
Under these models, (A1) and (A2) imply (A5). (A7) specifies the restrictions on the dimension of covariates and the dimension of the true nonzero coefficients . To gain insight, when the true model size is fixed, (A7) holds for , i.e., can be the same size as .
We establish the asymptotic properties of the penalized estimating equation procedure.
Results (12) and (13) imply that . Results (14) and (15) imply that with probability approaching to one, the penalized estimating equation procedure would not over-select irrelevant variables and estimate the true nonzero coefficients at the convergence rate, which is the so-called oracle property of variable selection.
Proof (Theorem 1). We provide a proof for a key step below and defer the complete proof till the supplementary material. The major technical hurdle for the proof is induced by the finite population inference framework, which does not exist in the conventional high-dimensional data setting. To help understand the problem, assume that Sample A is selected under simple random sampling. A major step in the proof is to show that Bernstein’s inequality holds for , where and . It is important to note that the ’s () are not independent random variables, because and are dependent for any . To overcome the technical challenge, we decompose