In the last decade, the interest in machine learning methods has been growing in national statistical offices (NSO). These data-driven methods provide flexible tools for obtaining accurate predictions. The increasing availability of data sources (e.g., big data sources and satellite information) provides a rich pool of potential predictors that may be used to obtain predictions at different stages of a survey. These stages include the nonresponse treatment stage (e.g., propensity score weighting and imputation) and the estimation stage (e.g., model-assisted estimation and small area estimation). The imputation stage is the focus of the current paper.
Item nonresponse refers to the presence of missing values for some, but not all, surveys variables. Frequent causes of item nonresponse include refusal to answer a sensitive question (e.g., income) and edit failures. The most common way of treating item nonresponse in NSOs is to replace a missing value with a single imputed value, constructed on the basis of a set of explanatory variables, available for both respondents and nonrespondents. A variety of imputation procedures are available, ranging from simple (e.g., mean, historical and ratio imputation) to more complex (e.g., nonparametric procedures); see, e.g., Chen_Haziza2019 for an overview of imputation procedures in surveys. Every imputation procedure makes some (implicit of explicit) assumptions about the distribution of the variable requiring imputation. This set of assumptions is often referred to as an imputation model. At the imputation stage, it is therefore important to identify and include in the model all the appropriate explanatory variables that are predictive of the variable requiring imputation and determine a suitable model describing the relationship between and the set of explanatory variables
We distinguish parametric imputation procedures from nonparametric imputation procedures. In parametric imputation, the shape of the relationship between and
is predetermined; e.g., linear and generalized linear regression models. However, point estimators based on parametric imputation procedures may suffer from bias if the form of the model is misspecified or if the specifiedfails to include interactions or predictors accounting for curvature. In contrast, with nonparametric methods, the shape of the relationship between and is left unspecified. These methods have the ability to capture nonlinear trends in the data and tend to be robust to the non-inclusion of interactions or predictors accounting for curvature.
Commonly used nonparametric methods include kernel smoothing, local polynomial regression and spline-based regression models. While these methods provide some robustness against model misspecification, they tend to breakdown when the dimension of
is large, a problem known as the curse of dimensionality. To mitigate this problem, one may employ additive models(Hastie1986) or machine learning methods. However, both a theoretical treatment as well as an empirical comparison of machine learning imputation procedures in the context of missing survey data are currently lacking. In this paper, we aim to fill the latter gap by conducting an extensive simulation study that investigates the performance of several nonparametric and machine learning procedures in terms of bias and efficiency. To that end, we generated several finite populations with relationships between and ranging from simple to complex and generated the missing values according to several nonresponse mechanisms. We also considered both a low-dimensional and high dimensional settings. The simulation setup and the models are described in Section 4.
In our simulation, we compared the following procedures: the score method (little1986survey; Haziza2007), K nearest-neighbour (chen2000nearest), generalized additive models based on B-spline regression, regression trees (Breiman_Friedman_Olshen_Stone1984), random forests (breiman2001random), tree boosting methods (friedman_2001; chen2016), Bayesian additive regression trees (Chipman2010), support vector regression (Vapnik1998; Vapnik2000) and the cubist algorithm (quinlan1992learning; quinlan1993combining). These methods are briefly described in Section 3.
Consider a finite population of size . We are interested in estimating the population total, of a survey variable . From we select a sample , of size , according to a sampling design
with first-order inclusion probabilities.
A complete data estimator of is the well-known Horvitz-Thompson estimator (HT):
Provided that for all , the estimator (1) is design-unbiased for .
In practice, the -variable may be prone to missing values. Let be a response indicator such that if is observed and otherwise. Let denote the set of respondents, of size and the set of nonrespondents, of size such that and . Available to the imputer is the data for where as well as the values of the vector for
Let be the imputed value used to replace the missing value . An imputed estimator of is then given by
To construct the imputed values , we postulate the following imputation model :
Often, the variance structureis assumed to be of the form where is a known coefficient attached to unit and is an unknown parameter.
The imputed values are generated from the empirical distribution of conditional on the set of explanatory variable , observed among the respondents. The validity of the imputed estimator (2) depends on whether or not the following condition holds:
That is, the first moment of the distribution ofgiven is the same for both respondents and nonrespondents. A sufficient condition for (4) to hold is that the data are Missing At Random (Rubin1976).
Because the effective sample size is smaller than the initially planned sample size the imputed estimator (2) is generally less efficient than the complete data estimator (1). The additional variance is called nonresponse variance. That is, the total variance of (2) is the sum of the sampling variance arising from the selection of and the nonresponse variance.
3 A description of imputation methods
3.1 Parametric regression imputation
Deterministic parametric regression assumes that the first moment (3) is given by
where is a vector of coefficients to be estimated and is a predetermined function. An estimator of is obtained by solving the estimating equations based on the responding units:
where is a weight attached to element . Common choices for include and (Chen_Haziza2019). The imputed value under parametric regression imputation is given by
The imputed value can be written as a weighted sum of the respondent -values as follows:
where If the intercept is among the -variables, then for all
Another important special case of (5
) is the logistic regression model,
which can be used for modeling binary variables. An estimator ofis obtained by solving (6), which requires a numerical algorithm such as the Newton-Raphson procedure. Typically, a missing value to a variable is imputed by where is a realization of a Bernoulli variable with parameter
Under parametric or random parametric regression imputation, the imputed estimator is consistent for provided that the first moment of the imputation model (3) is correctly specified; see, e.g., and Chen_Haziza2019. Variance estimation under deterministic parametric regression imputation has been considered in a number of research outputs; e.g., sarndal_1992, kim2009unified, haziza_2009, beaumont_bissonnette and hazizaVariance2020.
3.2 Imputation classes : the score method
The score method (little1986survey; Haziza2007) consists of partitioning the sample into (say) imputation classes and imputing the missing values within each class independently from one class to another. One implementation of the method is as follows:
For all , compute the preliminary values where is given by (8).
Compute the empirical quantilesof order of the -values.
Split the sample into classes, such that
with and .
It is common practice to use either mean imputation or random hot-deck imputation within classes. For mean imputation, the imputed value for missing in the th imputation class is given by
where are the same for all and for all For random hot-deck imputation, the imputed value is given by where the donor is selected at random from the set of donors belonging to the th imputation class with probability
3.3 -nearest neighbours imputation
-nearest neighbour (NN) imputation is a nonparametric imputation method; i.e., no explicit assumption is made about the regression function relating and . NN is one of the simplest and widely used nonparametric methods. NN imputation consists of replacing the missing value of a recipient by the weighted average of the -values of its closest respondents determined with respect to the -variables and a given distance function.
Nearest-neighbour (NN) imputation is the limiting case of NN obtained with . NN is a donor imputation belonging to the class of hot-deck procedures (chen2000nearest) since the missing values are replaced by a true, actual -value recorded for a respondent from the same file. NN imputation is especially useful for imputing categorical or discrete -variables; see also yang2019nearest.
Let be the set of respondent units nearest to Any distance function in may be used to measure the closeness between the vectors and for In the simulation study presented in Section 4, we used the customary Euclidean distance. The NN imputed value for missing is thus given by
where is a weight attached to element . The imputed value obtained with NN can be written as a weighted sum of the respondent -values:
where for with NN imputation is a locally weighted procedure since the respondents lying not close enough to unit with respect to the -variables are assigned a zero weight, The discontinuous indicator function from the expression of can be replaced by a one-dimensional continuous kernel smoother which controls the weight size through a tuning parameter the units lying farther from unit will have smaller weights than units lying close to it (hastie_tibshirani_friedman_2011).
If for all then for all and the imputed value becomes simply the average of the -nearest neighbours:
The imputed estimator under NN imputation tends to be less efficient when the dimension of is large. Indeed, as increases, it becomes more difficult to find enough respondents around the point at which we aim to make a prediction. This phenomenon is known as the curse of dimensionality (hastie_tibshirani_friedman_2011, Chap. 1) for a more in-depth discussion ok the NN procedure. Also, it suffers from a model bias which is of order Under mild assumptions, chen2000nearest and beaumont_bocci_2009 showed that the imputed estimator with the NN procedure is asymptotically unbiased for and suggested variance estimators.
3.4 B-splines and additive model nonparametric regression
Spline regression is a flexible nonparametric method for fitting non-linear functions from (3). It can be viewed as a simple extension of linear models. For simplicity, we start with a univariate -variable supported on the interval A spline function of order with equidistant interior knots, is a piecewise polynomial of degree between knots and smoothly connected at the knots. These spline functions span a linear space of dimension of with a basis function given by the -splines functions:
where if and equal to zero, otherwise; see (schumaker_1981; dierckx_1993). The -spline basis is appealing because the basis functions are strictly local: each function has the knots with for (zhou_shen_wolfe_1998), which means that its support consists of a small, fixed, finite number of intervals between knots. The unknown function is then approximated by a linear combination of basis functions with coefficients determined by a least squares criterion computed on the data (gogahaziza_splines2019). The missing value is then imputed by where
with denoting the vector of -spline basis functions, and minimizes
The expression of is similar to that obtained with linear regression imputation given in (8). Unlike linear regression, the regressors used here are the -spline functions and their number can vary in function of the chosen number of knots and the order of the -spline functions. The degree of the piecewise polynomial does not seem to have a great impact on the model fits. In practice, quadratic splines () and cubic splines () are typically used. Knots are usually placed at the -quantiles and their number may have a great effect on the model fits: a large will lead to overfitting, in which case a penalization criterion should be used in (10) while a small value of may lead to underfitting. gogahaziza_splines2019 studied the properties of the imputed estimator under B-spline imputation and derived corresponding variance estimators.
The imputed value with -spline regression can be also written as a weighted sum of the respondent -values similar to (9), for all with weights now given by These weights do not depend on the -values as in linear regression imputation and since for all Unlike linear regression imputation, the weights are now local due to the -spline functions ensuring more flexibility to model local nonlinear trends in the data.
We now consider the multivariate case but, for ease of presentation, we confine to the case of two predictors, and Additive models are the simplest way to model nonlinear trend in the data by considering several auxiliary variables (Hastie1986). The relationship between and is expressed as a linear combination of unknown smooth functions and :
where the are independent errors with mean equal to zero. Possible interactions between and are not considered with additive models. The unknown and can be estimated by using two -spline basis and respectively, which leads to and where and are determined by least square criterion as before. In order to ensure the identifiability of additional constraints such as are usually imposed. With these constraints, is simply obtained as a multiple regression coefficient estimator, for and The imputed value for missing is given by
In practice, backfitting algorithm is used to compute and iteratively (hastie_tibshirani_friedman_2011). It is worth pointing out that, when the number of auxiliary variables is large, the algorithm may not converge and additive models tend to breakdown.
3.5 Regression trees
Regression trees through the CART algorithm have been initially suggested by breiman1984classification
. Tree-based methods are simple to use in practice for both continuous and categorical variables and useful for interpretation. They form a class of algorithms which recursively split the-dimensional predictor space, the set of possible values for the -variables, into distinct and non-overlapping regions of and the prediction of from the model (3) at point is the mean of the -values for the observations falling in the same region as . When the number of -variables is not very too large, the splitting algorithm is quite fast, otherwise it may be time-consuming.
In a finite population sampling setting, toth2011building studied the consistency of . mcconville2019automated used regression trees in a model-assisted framework in the ideal case of 100% response. CreelD.&KrotkiK.2006 used tree-based methods to build imputation classes.
Here, we use regression trees as CreelD.&KrotkiK.2006. To that end, we slightly adapt the original CART algorithm as well as the estimation procedure of . The CART algorithm recursively searches for the splitting variable and the splitting position (i.e., the coordinates on the predictor space where to split) leading to the greatest possible reduction in the residual mean of squares before and after the splitting. More specifically, let be a region or node and let the number of units belonging to A split in consists of finding a pair where is the variable coordinates taking value between and and is the position of the split along the th coordinate, within the limits of Let be the set of all possible pairs in . The splitting process is performed by searching for the best split in the sense that:
where is the measure of -th variable for the th individual, , and the th coordinate of is the mean of for those units such that . From (11), we note that the best split is the one that produces a tree with the smallest residuals sum of squares hastie_tibshirani_2015; that is, we seek that minimizes
The missing is replaced by the weighted average of the -values, falling into the same region as
where is the region from containing the point and if and zero otherwise. With tree-methods, the imputed value can also be written as
where with With regression trees as well as with all tree-based methods, the non-overlapping -regions obtained by means of CART algorithm depend on the respondent data the same set of -variables with a different set of respondents will lead to different non-overlapping -regions. The resulting imputed estimator is similar to a post-stratified estimator based on adaptative post-strata.
Regression trees are simple to interpret and often exhibit a small model bias. However, they tend to overfit the data if each -region contains too few elements. To overcome this drawback, regression trees can be pruned, meaning that superfluous splits (with respect to a penalized version of (11)) may be removed from the tree. Pruning a regression tree tends to make its model variance decrease but also adds an additional model bias. For more details, see hastie_tibshirani_friedman_2011. Tree-based performances can be ameliorated by using bagging and boosting methods, which are discussed next.
3.6 Random forests
Random forest (breiman2001random) is an ensemble method which achieves better accuracy than tree-regression methods by creating a large number of different regression trees and combining them to produce more accurate predictions than a single model would. Random forests are especially efficient in complex settings such as small sample sizes, high-dimensional predictor space and complex relationships. Random forests are very popular and they have been extensively used since the paper of breiman2001random in various fields. Recently, their theoretical properties have been established by scornet2015consistency. In survey sampling framework, some applications of random forests include tipton2013properties; de2018sample. In the ideal case of 100% response, model-assisted estimators based on random forests have been studied in dagdoug2020model. dagdoug_goga_haziza_RF_NR established the theoretical properties of estimators based on random forests in the context of imputation for item nonresponse and data integration.
There exist a number of random forest algorithms (see biau2016random for discussion), but the main steps of the algorithm can be described as follows (dagdoug_goga_haziza_RF_NR):
Consider bootstrap data sets obtained by selecting with replacement couples from .
The imputed value for missing is obtained by averaging the predictions at the point of the regression tree predictions:
Although random forests are based on fully-grown trees, the accuracy of the predictions is improved by considering bootstrap of units and model aggregation, a procedure called bagging and used in statistical learning for reducing the variability. The number of regression trees should be large enough to ensure a good performance without harming the processing time; see scornet2017tuning. The second improvement brought by random forest is the random selection at each split of predictors, achieving decorrelated trees. The value of is typically chosen as (hastie_tibshirani_friedman_2011). In random forest algorithms, a stopping criterion is usually specified so that the algorithm stops once a certain condition (e.g., on the minimum number of units in each final nodes) is met.
3.7 Least square tree-boosting and other tree-boosting methods
Similarly to bagging, boosting (friedman_2001) is a procedure that can be applied to any statistical learning methods for improving the accuracy of model predictions and is typically used with tree-based methods. While bagging bootstraps sample units to obtain independent sets of units for creating many different predictions, boosting is an iterative method that starts with a weak fit (or learner) and improves it at each step of the algorithm by predicting the residuals of prior models and adding them together to make the final prediction.
To understand how boosting works, consider again a regression tree with non-overlapping regions expressed as
where the parameter is obtained by minimizing
denoting a loss function such as the quadratic loss function. With the latter, given a regionestimating the constant is usually straightforward as is the average the -values belonging to However, finding the regions and solving (14) in a traditional way may prove challenging and computationally intensive as it requires optimizing over all the parameters jointly. To overcome this difficulty, one strategy is to use a greedy top-down recursive partitioning algorithm to find as presented in section 3.5. Another strategy consists in splitting the optimization problem (14) into many simple subproblems that can be solved rapidly . Boosting uses this second strategy and considers that the unknown has the following additive form:
where for are trees determined iteratively by using the forward stagewise procedure (hastie_tibshirani_friedman_2011): at each step, a new tree is added to the expansion without modifying the coefficients and parameters of trees already added. Each added tree, usually referred to as a weak-learner, has a small size and slowly improves the estimation of in areas where it does not perform well. For the quadratic loss function, the algorithm becomes:
Step 1: Initialize the algorithm with a constant value: and
Step 2: For to :
for the region set and constants to be find for the next tree given the current model This means that we fit a regression tree that best predicts the residuals values of the previous model.
Given the regions the optimal constants are found as follows:
Step 3: Output and get the imputed value
The number of trees should not be too large and, for better performances, hastie_tibshirani_friedman_2011 recommend to consider the same number of splits at each iteration. The value of reflects the level of dominant interaction between the -variables. The value (one split) produces boosted models with only main effects without interactions, whereas the value allows for two-variable interactions. Empirical studies suggest that
leads to good results. As in ridge regression, shrinkage is used with tree boosting. In this case Step 2. (c) of the above algorithm is replaced by a penalized version:
where the parameter called learning rate, is used to penalized large trees; usually or Both and control the performance of the model prediction.
suggested a scalable end-to-end tree boosting system called XGBoost which is extremely fast. Consider again a tree with formal expression given in (13). This tree learning algorithm consists of minimizing the following objective function at the -th iteration:
where the penalty function penalizes large trees in order to avoid overfitting. The search problem is optimized by using a second-order Taylor approximation of and ignoring the constant term, the new optimization problem reduces to:
where and are the first and second order derivative of the loss function computed at . With the quadratic loss function, and The new objective function from (17) is a second order polynomial with respect to , so the optimal is easily obtained as leading to the optimal value of the objective function as This value is then used next as a decision criterion in a greedy top-down recursive algorithm to find the optimal regions of the -th tree to be added.
3.7.2 Bayesian additive regression trees (BART)
Bayesian additive regression trees (Chipman2010, BART) is similar to boosting in the sense that the unknown regression function has an additive form as in (15). While boosting is completely nonparametric, BART makes a Gaussian assumption on the residuals:
where is assumed to be a sum of tree function and is the set of parameter values associated with the terminal nodes in each tree .
BART relies on a probabilistic Bayesian framework. A prior is specified for the parameters of the model and . The prior of can be decomposed in three components :
The probability that a node at depth is given by for
The distribution on the splitting variable assignments in each interior node is uniform.
The distribution of the splitting value conditional on the chosen splitting variable is also uniform.
For the prior for
, a conjugate prior is chosen to make computations simpler; e.g.,is assumed to be . Similarly, a conjugate prior is chosen for
, e.g., the inverse chi-square distribution. To generate the posterior distribution, the authors suggest the use of a Gibbs sampler. For general guidelines about the choices of these parameters, seeChipman2010. The imputed value for missing is obtained as with the general boosting algorithm given in section (3.7).
3.8 Cubist algorithm
Cubist is an updated implementation of the M5 algorithm introduced by quinlan1992learning and quinlan1993combining
. It is an algorithm based on regression trees and linear models, among other ingredients. Initially, Cubist was only available under a commercial license. In 2011, the code was released as open-source. The algorithm proceeds as follows(kuhn_johnson, Chap. 8):
Create a regression tree using the following splitting criterion:
where denotes the observed data ; the subsets of after each of the splits,
denotes the empirical standard deviation of the values, represents the number of subsets of the data after splitting and denotes the number of units in
In each node of the tree, a linear model is fitted between the survey variable and the auxiliary variables that have been used to split the tree in the nodes above.
In each node, a backward elimination procedure is performed using the adjusted error rate (AER) criterion:
where is the number of units in the node used to build the linear model. Each model term is dropped and the AER is computed. Terms are dropped from the model as long as the AER decreases.
Once the tree is fully grown, it is pruned by removing unnecessary splits. Starting at the terminal nodes, the AER is computed with and without the node. Whenever the node does not decrease the AER, it is pruned. This process is performed until no more node can be removed.
To avoid over-fitting, a smoothing procedure is performed. Let be the predicted value obtained by fitting the linear model in the th child node and be the obtained from the direct parent node. These predictions are combined using
where with , and and denote the empirical model variance and covariance, respectively.
Cubist can be used as an ensemble model. Once the Cubist algorithm is fitted, the subsequent iterations of the algorithm use the previously trained algorithm to define an adjusted response that the next iteration of the algorithm will use:
where is the value of the adjusted response for the th iteration of the Cubist algorithm.
The final imputed value for missing is derived using a nearest-neighbour rule: