Much of the data used to make informed decisions come from surveys using probability sample designs to select units for data collection. Indeed, governmental statistical agencies use sample data to produce estimates of official statistics consisting of population means and/or totals that are vital to inform policy makers and the public about the current state of the economy, public health, and environment among others topics. The quality of these estimates varies between samples and depends on the amount and quality of data provided by the respondents. This variability may be reduced and the estimator made more efficient if variables in an auxiliary dataset, associated with the study variable, are available for all units in the population.
Consider a finite population total where represents the value of the study variable for the th unit in the population indexed by the set . Assume auxiliary variables, are known for each . After a random sample is selected using a complex sample design, with inclusion probabilities , the total can be estimated using the data from the sampled units. A standard approach is to use the Horvitz-Thompson (HT) estimator
The HT estimator is a design unbiased estimator of
. Under a purely design-based framework, statistical properties, such as bias and variance, are measured with respect to repeated sampling of units from the population and auxiliary data do not impact the form of the estimator but can impactthrough construction of the sample design.
Auxiliary information can be incorporated more directly into the estimation procedure through a model-assisted framework. In this framework, the estimators are constructed to be asymptotically design unbiased and consistent regardless of model misspecification and are typically more efficient than estimators based solely on the sample data. Two common model-assisted approaches are calibration estimators  and generalized regression estimators . The calibration estimator is a weighted sum of the sampled study variable, similar to equation (1), but the weights, , are constructed so that they are close to the HT weights, , and reproduce (i.e. are calibrated to) known totals when applied to auxiliary variables
In contrast, the generalized regression estimator employs a model to predict for each . The predictions are obtained by assuming the finite population values, , are independent realizations from a superpopulation model, denoted by , with mean function where represents the expectation evaluated with respect to Then, is estimated by on the sampled observations. Denoting an estimator of as , the generalized regression estimator for is
The generalized regression estimator under a linear mean function can also be written as a weighted sum of the study variable, :
. where and are the HT estimators of . The sampling weights given by equation (3), , are calibrated to , and therefore, under the linear model, the generalized regression estimator is a calibration estimator.
The auxiliary data available for the target population of a survey often include categorical variables, often with a large number of categories. Complex interactions can exist between variables and the variables are usually not independent. This makes it challenging to model the study variable using standard parametric modeling approaches. For example, the Quarterly Census of Employment and Wages (QCEW) is the sampling frame for many establishment surveys at the US Bureau of Labor Statistics (BLS). QCEW is compiled from administrative records from the state unemployment insurance programs and contains several categorical variables, such as industry code, whether or not the establishment is part of a multi-establishment firm, ownership-type, the state and region in which the establishment resides, and a size class for the establishment.
A variety of linear models can be constructed from these categorical frame variables, ranging from an additive model to a model that includes all possible cross classifications. The generalized regression estimator based on an additive, linear model is equivalent to the generalized raking estimator  and the saturated, linear model results in the post-stratification estimator . While the additive, linear model would produce an estimator that is calibrated on each individual category of the auxiliary variables, it would miss potentially informative interactions. At the other extreme, calibrating on all possible interactions could produce significantly variable weights (sometimes negative weights), and could greatly increase the variance of the estimator. The weights produced by the estimator based on the additive, linear model may also be variable if one of the categorical variables has many categories. We will see that using a large number of variables with a linear model often leads to a poorly fit model, producing an estimator that is less efficient than a purely design-based estimator, such as the HT estimator.
Motivated by the ubiquity of categorical variables and interactions between variables in survey data,  developed regression trees, which are binary trees based on a recursive partitioning algorithm, as a method for analyzing survey data. Instead of including all categories and/or all interactions, regression tree models group the categories of a variable based on how they relate with the study variable and only include variables and interactions associated with the study variable.
Recursive partitioning algorithms sequentially partition the data into two groups based on an auxiliary variable and estimate the mean in each group. The groups are selected by finding the split that provides the greatest reduction in the mean square error at each split. This recursive splitting allows for complex interactive effects to easily be incorporated into the model. For instance, when splitting on a categorical variable, the categories of the variable are divided into the two groups that are most homogeneous with respect to the study variable. The algorithm will only continue splitting if significant reduction in the mean square error is achieved. Therefore, inclusion of a categorical variable does not require a split for each category unlike the linear regression model. This can substantially reduce the model size while still capturing interactions between categorical variables.
, hot deck imputation, and exploratory analysis . We consider finite population estimation and study the behavior of the generalized regression estimator with a regression tree working model. Other nonparametric models, such as local polynomial regression (; ), penalized splines (; 
), and neural networks (), have also been explored for model-assisted estimation. Generally, the model-assisted estimators based on a nonparametric model tend to be more efficient than their parametric counterparts when the relationship between the study variable and auxiliary variables is nonlinear. A survey of these techniques can be found in . Despite the development of these more sophisticated model-assisted estimators, they are rarely ever used by statistical agencies to produce official statistics in anything but the simplest calibration framework where a linear model is assumed. Since a regression tree can be framed as a linear model but overcomes the modeling issues often associated with the auxiliary data available to statistical agencies, we believe it has great potential as an estimation tool for official statistics.
Section 2 contains the development of the model assisted estimator using a design consistent regression tree model and shows how the estimator can be framed as a post-stratification estimator where the post-strata are given by the regression tree. This section also contains the statement of the main result that the model-assisted estimator is asymptotically design unbiased and consistent under general conditions on the sample design and population distribution. Using the BLS Occupational Employment Statistics Survey data, three estimators: the regression tree estimator, the linear regression estimator, which is a generalized regression estimator with a linear working model, and the Horvitz-Thompson estimator, are compared in Section 3. The regression tree estimator dominates the other estimators when much of the variability in the study variable can be explained by key interactions between the auxiliary variables. Section 4 discusses implementation of the estimator and possible extensions. Conditions and proof of the main result are contained in the appendix.
2 Proposed Estimator
Many methods have been considered for estimating . We propose estimating based on a recursive partitioning algorithm applied to . Let represent the partitioning boxes of the sample based on the algorithm and let if and 0, otherwise. For the estimator of is given by
the Horvitz-Thompson estimator of the population size in box . Notice is the Hajek (1971) estimator of the population mean for box .  show that is consistent for the superpopulation mean function, . The necessary regularity conditions for the recursive partitioning algorithm so that the estimator is consistent are given in the Appendix.
The regression tree estimator, is obtained by inserting equation (4) into the generalized regression estimator, given in equation (2). Since is random with respect to the design, is not design unbiased for , but is asymptotically design unbiased and consistent under general conditions on the sample design and the superpopulation model. The conditions and proofs to establish the following asymptotic result are presented in the Appendix. Under assumptions A1- A5, is asymptotically design unbiased in the sense that
and design consistent in the sense that
for all . The proof of the above result follows the method used in , but because of the dependence between the splits in a regression tree and the study variable, consistency of the model assisted estimator using a regression tree model does not follow directly from the consistency of the regression tree model shown in that paper. As a consequence, the proofs require adjusted rates, including a different sampling fraction, to those given in .
From equation (4) we see that for a given partition, , the mean estimator, , can be written as a linear regression estimator with indicator function covariates
Therefore, the regression tree estimator is also a post-stratification estimator where each box, , represents a post-strata. This also implies that the estimator is calibrated to the population total of each box. From this perspective, the tree provides a data-driven mechanism for selecting the post-stratification where none of the resulting post-strata are empty.
3 Simulation Study using Occupational Employment Statistics Survey Data
The BLS Occupational Employment Statistics survey (OES) is a semi-annual establishment survey collecting occupational employment and wage data for workers in non-farm industries. The survey data are used to publish annual occupational employment and wage estimates for over 800 occupations at the metropolitan and micropolitan statistical area (MSA) level and for specific industries at the national level. These estimates are used by the public and policy-makers to identify occupations that are growing or shrinking in the economy and to make labor projection forecasts.
Using a sample design that is stratified by area and industry, a sample of about 200,000 establishments are selected each May and November where establishments are selected with probability proportional to size within each stratum. Sample weights of the selected sample units are adjusted so that the weighted employment totals are calibrated to QCEW employment totals. See  for more details on the sample design of the OES.
The frame of establishments used by the OES is derived from the QCEW, and contains the size class of the MSA where the establishment is located, whether or not the establishment is part of a multi-establishment company, and the establishment’s industry and size-class, for each establishment in the population. Since the number of employees at an establishment with a specific occupation code varies between types of establishment, a model-assisted estimator which accounts for the establishment characteristics should be more efficient than an purely design-based estimator. However, usually there are only a few establishment-level variables or categories related to occupational employment and how these variables relate to employment numbers depends on the occupation. Therefore, the model and auxiliary variables used should vary for each occupation employment estimator.
Consider the regression tree model shown in Figure 1, relating the number of bartenders per establishment to the establishment characteristic variables available in the QCEW data. The first (and by far the most important) split identifies establishments with industry codes 71 (Arts, Entertainment, and Recreation) which include casinos, ski-resorts and amusement parks, and 72 (Accommodation and Food Services) which includes bars and restaurants, as having a higher average number of employees with the occupation of bartender than establishments in other industries. This is not surprising for this occupation, but you would not expect the same split for other occupations.
Figure 2 shows the tree model for elementary school teachers. The first split separates industry code 61 (Educational Services), which includes schools, from all other industries. Again, this is as expected, as is the fact that larger schools have a higher number of elementary school teachers than smaller schools.
While bartenders and elementary school teachers are predominately found in different industry sectors, they are both service occupations, so they are both found, to a much lesser extent, in some of the same service sector industry codes. Both are found, for example, in industry code 56 (Administrative and Support Services) and industry code 81 (Other Services). Industry code 56 includes convention centers which employ bartenders and also includes correctional and support facilities which occasionally employ elementary school teachers, while industry code 81 includes social clubs with bartenders and daycare centers which sometimes have elementary school teachers. Most of the following splits are on size of the establishments, where again (not surprisingly) establishments with more employees are expected to have more employees with the respective occupational code than establishments with fewer employees in the same industry group.
Treating the OES data as the target population, we conducted a simulation study to compare the performance of the regression tree estimator to a linear regression estimator and to a Horvitz-Thompson (HT) estimator. For each estimator, we estimated employment totals for a variety of occupations. The test was done over 1000 repeated samples of size from the 187,115 establishments in the OES dataset using a stratified design where establishments in larger size classes were sampled with higher probability than establishments in smaller size classes, resulting in an unequal probability sample design.
Since the regression tree model conducts variable selection at each split, it provides an automated method for choosing which variable and category combinations form the post-strata on which the estimator is calibrated. For a comparable contender, we also automated the calibration selection for the linear model by using a forward step-wise procedure to select variables for inclusion in the model. However, since the two variable selection procedures differ, the resulting calibrations also differ. The stepwise linear estimator is calibrated to all categories of the variables selected by the procedure while the regression tree estimator is calibrated to the sub-groups represented by the tree’s resulting partition. For example, an regression tree estimator based on the tree in Figure 2 would be calibrated to the total number of establishments in each of the 10 sub-groups defined by each end-node. A linear regression estimator with the same two variables (industry, size) would be calibrated to the totals of each of the 30 categories, but no interactions between categories. We also considered a stepwise linear estimator based on all possible interactions but the estimator performed very poorly and therefore is not included in the results.
For each sample, the models were fit to the sample data using weights equal to the inverse of their inclusion probabilities. We used the forward-stepwise variable selection procedure in R  for the linear regression models and the R package rpms  to build the regression trees. An estimate of the total employment for each occupation tested was produced for each estimator.
The empirical mean squared error (MSE) of the 1000 total estimates by sample size for each of the three estimators is shown in Figure 3 for four of the occupations that we tested: elementary school teachers, waiters and waitresses, bartenders, and sales-managers. The top two graphs in Figure 3 contain the test results for elementary school teachers and for waitstaff, respectively. Both graphs show a large increase in efficiency using the regression tree estimator (dotted-line) compared to using the linear regression estimator (dashed-line) and the HT estimator (solid line). This is in contrast to the results for bartender and sales-manager occupation codes, which are shown in the bottom two graphs in Figure 3, respectively. These graphs show no real gain in MSE using either of the model-assisted estimators compared to the HT estimator. It is worth noting that the top two occupation codes are large occupation classes with a total employment in our OES finite population of 205,352 for elementary school teachers and 213,206 for waitstaff, while the occupation codes bartender and sales-manager have much smaller total employment of 35,811 and 34,585 respectively.
While for all four occupation codes the regression tree estimator is either more efficient or has about the same efficiency as the HT estimator, the linear regression estimator is never more efficient than the HT estimate and for elementary school teachers and sales-managers, it is less efficient. The results for these four occupation codes are consistent with the results obtained for all the occupation codes tested, in that, the regression tree estimator was either more efficient or of similar efficiency as the HT estimator with efficiency gains occurring in larger occupations, while the linear regression estimator had either similar efficiency as the HT estimator or was occasionally much less efficient.
In this article we have presented the regression tree estimator for a finite population total and have developed the design consistency of the estimator under standard assumptions on the sample design and finite population. Following a model-assisted framework, the estimator is a generalized regression estimator where the working model is a regression tree. An implementation of the estimator will be available in an R package on The Comprehensive R Archive Network (CRAN).
The regression tree estimator is also a post-stratification estimator since a regression tree can be written as a linear model where each variable is an indicator function for the sequential splits to an end node of the tree. Therefore, the regression tree can be viewed as a data-driven technique for choosing the appropriate set of post-strata. These post-strata have the ability to capture complex interactions between the variables and as shown in the simulations, they can increase the efficiency of the model-assisted estimator. Additionally, the estimator is calibrated to the population totals of each post-strata.
Since the auxiliary data available for establishment surveys are often mostly categorical, we have focused on useful features of how regression trees handle categorical data, such as, collapsing categories into homogeneous sub-groups and capturing predictive interactions between specific categories. However, regression trees are also useful for quantitative auxiliary data, especially if the data have a non-linear relationship with the study variable.
There are a couple of extensions of the regression tree estimator to be considered in future work. If the study variable relates linearly with an auxiliary variable, then a simple linear regression model instead of a simple mean model could be fit to each of the end nodes. Another option would be to use a random forest, which is a collection of regression trees fit on subsets of the variables and/or data, in the generalized regression estimator since a random forest tends to have greater predictive power than a single regression tree.
Appendix: Assumptions and Proof of Theorem 2
Rate conditions: Define two functions of that will be used as rates of convergence. Let and be given functions bounded above 0 for all satisfying:
Assume we have a sequence of nested populations, . Let the sample, , be selected according to a sample design with sample size . Denote the first-order inclusion probabilities by and the second-order inclusion probabilities by . Let be an indicator function representing whether unit is or is not in the sample. Let and denote the expectation and the variance evaluated with respect to the sample design, We suppress the subscript in as well as in, , and for simplicity of notation. Assume:
For any set , with -probability 1.
For , where and for , where
The sampling fraction is given by Let and .
[Proof of Theorem 2]
By Markov’s Inequality, we can achieve asymptotic design unbiasedness and design consistency by showing that
Without loss of generality, assume the support of is . First, we define a population mean estimator of using the sample-based partition, ,
Proposition 1 of  prove the mean square error consistency of the sample estimator, to the superpopulation mean, . Within the proof of the result, mean square error consistency of the sample estimator to the finite population estimator using the sample-based partition is provided. In particular, under assumptions A1 -A5, Proposition 1 of  implies that .
By adding and subtracting
and applying the Triangle Inequality, we get
For the first term, we have
Applying the law of total expectation to the first term gives
- Bechtel et al.  Bechtel, L. T., Morris, D. S. and Thompson, K. J. (2012) Using classification trees to recommend hot deck imputation methods: A case study.
- BLS  BLS (2008) Bls handbook of methods, chapter 3, "occupational employment statistics". URL: https://www.bls.gov/opub/hom/pdf/homch3.pdf.
- Breidt et al.  Breidt, F. J., Claeskens, G. and Opsomer, J. D. (2005) Model-assisted estimation for complex surveys using penalised splines. Biometrika, 92, 831–846.
- Breidt and Opsomer  Breidt, F. J. and Opsomer, J. D. (2000) Local polynomial regression estimators in survey sampling. Annals of Statistics, 28, 1026–1053.
- Breidt and Opsomer  — (2017) Model-assisted survey estimation with modern prediction techniques. Statistical Science, To Appear.
- Breiman  Breiman, L. (2001) Random forests. Machine learning, 45, 5–32.
- Deville and Särndal  Deville, J.-C. and Särndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376–382.
- Deville et al.  Deville, J.-C., Särndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American statistical Association, 88, 1013–1020.
- Ellis and Thompson  Ellis, Y. and Thompson, K. J. (2015) Exploratory data analysis of economic census products: Methods and results. In Proceedings of the Section on Survey Methods.
- Hájek  Hájek, J. (1971) Comment on a paper by D. Basu. In Foundations of Statistical Inference (eds. V. P. Godambe and D. A. Sprott), 236. Toronto: Holt, Rinehart and Winston.
- Horvitz and Thompson  Horvitz, D. G. and Thompson, D. J. (1952) A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.
- Lohr et al.  Lohr, S., Hsu, V. and Montaquila, J. (2015) Using classification and regression trees to model survey nonresponse. In Proceedings of the Survey Research Methods Section, 2071–2085.
- McConville and Breidt  McConville, K. S. and Breidt, F. J. (2013) Survey design asymptotics for the model-assisted penalised spline regression estimator. Journal of Nonparametric Statistics, 25, 745–763.
- Montanari and Ranalli  Montanari, G. E. and Ranalli, M. G. (2005) Nonparametric model calibration estimation in survey sampling. Journal of the American Statistical Association, 100, 1429–1442.
- Morgan and Sonquist  Morgan, J. N. and Sonquist, J. A. (1963) Problems in the analysis of survey data, and a proposal. Journal of the American statistical association, 58, 415–434.
- Phipps and Toth  Phipps, P. and Toth, D. (2012) Analyzing establishment nonresponse using an interpretable regression tree model with linked administrative data. The Annals of Applied Statistics, 772–794.
- R Core Team  R Core Team (2016) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.
- Särndal et al.  Särndal, C. E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. New York: Springer-Verlag.
- Smith  Smith, T. M. F. (1991) Post-stratification. The Statistician, 315–323.
- Toth  Toth, D. (2017) rpms: Recursive Partitioning for Modeling Survey Data. R package version 0.2.1.
- Toth and Eltinge  Toth, D. and Eltinge, J. (2011) Building consistent regression trees from complex sample data. Journal of the American Statistical Association, 106, 1626–1636.