Automated Selection of Post-Strata using a Model-Assisted Regression Tree Estimator

by   Kelly S. McConville, et al.
Swarthmore College

Auxiliary information can increase the efficiency of survey estimators through an assisting model when the model captures some of the relationship between the auxiliary data and the study variables. Despite their superior properties, model-assisted estimators are rarely used in anything but their simplest form by statistical agencies to produce official statistics. This is due to the fact that the more complicated models that have been used in model-assisted estimation are often ill suited to the available auxiliary data. Under a model-assisted framework, we propose a regression tree estimator for a finite population total. Regression tree models are adept at handling the type of auxiliary data usually available in the sampling frame and provide a model that is easy to explain and justify. The estimator can be viewed as a post-stratification estimator where the post-strata are automatically selected by the recursive partitioning algorithm of the regression tree. We establish consistency of the regression tree estimator and compare its performance to other survey estimators using the US Bureau of Labor Statistics Occupational Employment Statistics Survey.



There are no comments yet.


page 1


Model-assisted estimation in high-dimensional settings for survey data

Model-assisted estimators have attracted a lot of attention in the last ...

Generalised regression estimation given imperfectly matched auxiliary data

Generalised regression estimation allows one to make use of available au...

An approximate Bayesian approach to regression estimation with many auxiliary variables

Model-assisted estimation with complex survey data is an important pract...

Model-assisted estimation through random forests in finite population sampling

Surveys are used to collect data on a subset of a finite population. Mos...

Design-unbiased statistical learning in survey sampling

Design-consistent model-assisted estimation has become the standard prac...

An Evaluation of Design-based Properties of Different Composite Estimators

For the last decades, the US Census Bureau has been using the AK composi...

Double-calibration estimators accounting for under-coverage and nonresponse in socio-economic surveys

Under-coverage and nonresponse problems are jointly present in most soci...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Much of the data used to make informed decisions come from surveys using probability sample designs to select units for data collection. Indeed, governmental statistical agencies use sample data to produce estimates of official statistics consisting of population means and/or totals that are vital to inform policy makers and the public about the current state of the economy, public health, and environment among others topics. The quality of these estimates varies between samples and depends on the amount and quality of data provided by the respondents. This variability may be reduced and the estimator made more efficient if variables in an auxiliary dataset, associated with the study variable, are available for all units in the population.

Consider a finite population total where represents the value of the study variable for the th unit in the population indexed by the set . Assume auxiliary variables, are known for each . After a random sample is selected using a complex sample design, with inclusion probabilities , the total can be estimated using the data from the sampled units. A standard approach is to use the Horvitz-Thompson (HT) estimator


[11], where

The HT estimator is a design unbiased estimator of

. Under a purely design-based framework, statistical properties, such as bias and variance, are measured with respect to repeated sampling of units from the population and auxiliary data do not impact the form of the estimator but can impact

through construction of the sample design.

Auxiliary information can be incorporated more directly into the estimation procedure through a model-assisted framework. In this framework, the estimators are constructed to be asymptotically design unbiased and consistent regardless of model misspecification and are typically more efficient than estimators based solely on the sample data. Two common model-assisted approaches are calibration estimators [7] and generalized regression estimators [18]. The calibration estimator is a weighted sum of the sampled study variable, similar to equation (1), but the weights, , are constructed so that they are close to the HT weights, , and reproduce (i.e. are calibrated to) known totals when applied to auxiliary variables

In contrast, the generalized regression estimator employs a model to predict for each . The predictions are obtained by assuming the finite population values, , are independent realizations from a superpopulation model, denoted by , with mean function where represents the expectation evaluated with respect to Then, is estimated by on the sampled observations. Denoting an estimator of as , the generalized regression estimator for is


The generalized regression estimator under a linear mean function can also be written as a weighted sum of the study variable, :


[18]. where and are the HT estimators of . The sampling weights given by equation (3), , are calibrated to , and therefore, under the linear model, the generalized regression estimator is a calibration estimator.

The auxiliary data available for the target population of a survey often include categorical variables, often with a large number of categories. Complex interactions can exist between variables and the variables are usually not independent. This makes it challenging to model the study variable using standard parametric modeling approaches. For example, the Quarterly Census of Employment and Wages (QCEW) is the sampling frame for many establishment surveys at the US Bureau of Labor Statistics (BLS). QCEW is compiled from administrative records from the state unemployment insurance programs and contains several categorical variables, such as industry code, whether or not the establishment is part of a multi-establishment firm, ownership-type, the state and region in which the establishment resides, and a size class for the establishment.

A variety of linear models can be constructed from these categorical frame variables, ranging from an additive model to a model that includes all possible cross classifications. The generalized regression estimator based on an additive, linear model is equivalent to the generalized raking estimator [8] and the saturated, linear model results in the post-stratification estimator [19]. While the additive, linear model would produce an estimator that is calibrated on each individual category of the auxiliary variables, it would miss potentially informative interactions. At the other extreme, calibrating on all possible interactions could produce significantly variable weights (sometimes negative weights), and could greatly increase the variance of the estimator. The weights produced by the estimator based on the additive, linear model may also be variable if one of the categorical variables has many categories. We will see that using a large number of variables with a linear model often leads to a poorly fit model, producing an estimator that is less efficient than a purely design-based estimator, such as the HT estimator.

Motivated by the ubiquity of categorical variables and interactions between variables in survey data, [15] developed regression trees, which are binary trees based on a recursive partitioning algorithm, as a method for analyzing survey data. Instead of including all categories and/or all interactions, regression tree models group the categories of a variable based on how they relate with the study variable and only include variables and interactions associated with the study variable.

Recursive partitioning algorithms sequentially partition the data into two groups based on an auxiliary variable and estimate the mean in each group. The groups are selected by finding the split that provides the greatest reduction in the mean square error at each split. This recursive splitting allows for complex interactive effects to easily be incorporated into the model. For instance, when splitting on a categorical variable, the categories of the variable are divided into the two groups that are most homogeneous with respect to the study variable. The algorithm will only continue splitting if significant reduction in the mean square error is achieved. Therefore, inclusion of a categorical variable does not require a split for each category unlike the linear regression model. This can substantially reduce the model size while still capturing interactions between categorical variables.

Regression trees have been applied to survey data in a variety of contexts such as analyzing nonresponse [16], nonresponse adjustment, [12]

, hot deck imputation

[1], and exploratory analysis [9]. We consider finite population estimation and study the behavior of the generalized regression estimator with a regression tree working model. Other nonparametric models, such as local polynomial regression ([4]; [14]), penalized splines ([3]; [13]

), and neural networks (

[14]), have also been explored for model-assisted estimation. Generally, the model-assisted estimators based on a nonparametric model tend to be more efficient than their parametric counterparts when the relationship between the study variable and auxiliary variables is nonlinear. A survey of these techniques can be found in [5]. Despite the development of these more sophisticated model-assisted estimators, they are rarely ever used by statistical agencies to produce official statistics in anything but the simplest calibration framework where a linear model is assumed. Since a regression tree can be framed as a linear model but overcomes the modeling issues often associated with the auxiliary data available to statistical agencies, we believe it has great potential as an estimation tool for official statistics.

Section 2 contains the development of the model assisted estimator using a design consistent regression tree model and shows how the estimator can be framed as a post-stratification estimator where the post-strata are given by the regression tree. This section also contains the statement of the main result that the model-assisted estimator is asymptotically design unbiased and consistent under general conditions on the sample design and population distribution. Using the BLS Occupational Employment Statistics Survey data, three estimators: the regression tree estimator, the linear regression estimator, which is a generalized regression estimator with a linear working model, and the Horvitz-Thompson estimator, are compared in Section 3. The regression tree estimator dominates the other estimators when much of the variability in the study variable can be explained by key interactions between the auxiliary variables. Section 4 discusses implementation of the estimator and possible extensions. Conditions and proof of the main result are contained in the appendix.

2 Proposed Estimator

Many methods have been considered for estimating . We propose estimating based on a recursive partitioning algorithm applied to . Let represent the partitioning boxes of the sample based on the algorithm and let if and 0, otherwise. For the estimator of is given by



the Horvitz-Thompson estimator of the population size in box . Notice is the Hajek (1971) estimator of the population mean for box . [21] show that is consistent for the superpopulation mean function, . The necessary regularity conditions for the recursive partitioning algorithm so that the estimator is consistent are given in the Appendix.

The regression tree estimator, is obtained by inserting equation (4) into the generalized regression estimator, given in equation (2). Since is random with respect to the design, is not design unbiased for , but is asymptotically design unbiased and consistent under general conditions on the sample design and the superpopulation model. The conditions and proofs to establish the following asymptotic result are presented in the Appendix. Under assumptions A1- A5, is asymptotically design unbiased in the sense that

and design consistent in the sense that

for all . The proof of the above result follows the method used in [21], but because of the dependence between the splits in a regression tree and the study variable, consistency of the model assisted estimator using a regression tree model does not follow directly from the consistency of the regression tree model shown in that paper. As a consequence, the proofs require adjusted rates, including a different sampling fraction, to those given in [21].

2.1 Post-Stratification

From equation (4) we see that for a given partition, , the mean estimator, , can be written as a linear regression estimator with indicator function covariates


Therefore, the regression tree estimator is also a post-stratification estimator where each box, , represents a post-strata. This also implies that the estimator is calibrated to the population total of each box. From this perspective, the tree provides a data-driven mechanism for selecting the post-stratification where none of the resulting post-strata are empty.

3 Simulation Study using Occupational Employment Statistics Survey Data

The BLS Occupational Employment Statistics survey (OES) is a semi-annual establishment survey collecting occupational employment and wage data for workers in non-farm industries. The survey data are used to publish annual occupational employment and wage estimates for over 800 occupations at the metropolitan and micropolitan statistical area (MSA) level and for specific industries at the national level. These estimates are used by the public and policy-makers to identify occupations that are growing or shrinking in the economy and to make labor projection forecasts.

Using a sample design that is stratified by area and industry, a sample of about 200,000 establishments are selected each May and November where establishments are selected with probability proportional to size within each stratum. Sample weights of the selected sample units are adjusted so that the weighted employment totals are calibrated to QCEW employment totals. See [2] for more details on the sample design of the OES.

The frame of establishments used by the OES is derived from the QCEW, and contains the size class of the MSA where the establishment is located, whether or not the establishment is part of a multi-establishment company, and the establishment’s industry and size-class, for each establishment in the population. Since the number of employees at an establishment with a specific occupation code varies between types of establishment, a model-assisted estimator which accounts for the establishment characteristics should be more efficient than an purely design-based estimator. However, usually there are only a few establishment-level variables or categories related to occupational employment and how these variables relate to employment numbers depends on the occupation. Therefore, the model and auxiliary variables used should vary for each occupation employment estimator.

Consider the regression tree model shown in Figure 1, relating the number of bartenders per establishment to the establishment characteristic variables available in the QCEW data. The first (and by far the most important) split identifies establishments with industry codes 71 (Arts, Entertainment, and Recreation) which include casinos, ski-resorts and amusement parks, and 72 (Accommodation and Food Services) which includes bars and restaurants, as having a higher average number of employees with the occupation of bartender than establishments in other industries. This is not surprising for this occupation, but you would not expect the same split for other occupations.

Figure 1: Tree model on full OES data set for predicting the number of bartenders per establishment. The survey-weighted average number of bartenders is given at each end node.

Figure 2 shows the tree model for elementary school teachers. The first split separates industry code 61 (Educational Services), which includes schools, from all other industries. Again, this is as expected, as is the fact that larger schools have a higher number of elementary school teachers than smaller schools.

Figure 2: Tree model on full OES data set for predicting the number of elementary school teachers (excluding special education) per establishment. The survey-weighted average number of elementary school teachers is given at each end node.

While bartenders and elementary school teachers are predominately found in different industry sectors, they are both service occupations, so they are both found, to a much lesser extent, in some of the same service sector industry codes. Both are found, for example, in industry code 56 (Administrative and Support Services) and industry code 81 (Other Services). Industry code 56 includes convention centers which employ bartenders and also includes correctional and support facilities which occasionally employ elementary school teachers, while industry code 81 includes social clubs with bartenders and daycare centers which sometimes have elementary school teachers. Most of the following splits are on size of the establishments, where again (not surprisingly) establishments with more employees are expected to have more employees with the respective occupational code than establishments with fewer employees in the same industry group.

Treating the OES data as the target population, we conducted a simulation study to compare the performance of the regression tree estimator to a linear regression estimator and to a Horvitz-Thompson (HT) estimator. For each estimator, we estimated employment totals for a variety of occupations. The test was done over 1000 repeated samples of size from the 187,115 establishments in the OES dataset using a stratified design where establishments in larger size classes were sampled with higher probability than establishments in smaller size classes, resulting in an unequal probability sample design.

Since the regression tree model conducts variable selection at each split, it provides an automated method for choosing which variable and category combinations form the post-strata on which the estimator is calibrated. For a comparable contender, we also automated the calibration selection for the linear model by using a forward step-wise procedure to select variables for inclusion in the model. However, since the two variable selection procedures differ, the resulting calibrations also differ. The stepwise linear estimator is calibrated to all categories of the variables selected by the procedure while the regression tree estimator is calibrated to the sub-groups represented by the tree’s resulting partition. For example, an regression tree estimator based on the tree in Figure 2 would be calibrated to the total number of establishments in each of the 10 sub-groups defined by each end-node. A linear regression estimator with the same two variables (industry, size) would be calibrated to the totals of each of the 30 categories, but no interactions between categories. We also considered a stepwise linear estimator based on all possible interactions but the estimator performed very poorly and therefore is not included in the results.

For each sample, the models were fit to the sample data using weights equal to the inverse of their inclusion probabilities. We used the forward-stepwise variable selection procedure in R [17] for the linear regression models and the R package rpms [20] to build the regression trees. An estimate of the total employment for each occupation tested was produced for each estimator.

Figure 3: Simulation results comparing the Horvitz-Thompson estimator (green solid-line), linear regression estimator (orange dashed-line) and the regression tree estimator (blue dotted-line) of employment total for the occupations: (top to bottom) elementary school teachers, waiters and waitresses, bartenders, and sales-managers. The graphs display the mean squared error of the estimates for the 1,000 repeated samples by sample size.

The empirical mean squared error (MSE) of the 1000 total estimates by sample size for each of the three estimators is shown in Figure 3 for four of the occupations that we tested: elementary school teachers, waiters and waitresses, bartenders, and sales-managers. The top two graphs in Figure 3 contain the test results for elementary school teachers and for waitstaff, respectively. Both graphs show a large increase in efficiency using the regression tree estimator (dotted-line) compared to using the linear regression estimator (dashed-line) and the HT estimator (solid line). This is in contrast to the results for bartender and sales-manager occupation codes, which are shown in the bottom two graphs in Figure 3, respectively. These graphs show no real gain in MSE using either of the model-assisted estimators compared to the HT estimator. It is worth noting that the top two occupation codes are large occupation classes with a total employment in our OES finite population of 205,352 for elementary school teachers and 213,206 for waitstaff, while the occupation codes bartender and sales-manager have much smaller total employment of 35,811 and 34,585 respectively.

While for all four occupation codes the regression tree estimator is either more efficient or has about the same efficiency as the HT estimator, the linear regression estimator is never more efficient than the HT estimate and for elementary school teachers and sales-managers, it is less efficient. The results for these four occupation codes are consistent with the results obtained for all the occupation codes tested, in that, the regression tree estimator was either more efficient or of similar efficiency as the HT estimator with efficiency gains occurring in larger occupations, while the linear regression estimator had either similar efficiency as the HT estimator or was occasionally much less efficient.

4 Conclusions

In this article we have presented the regression tree estimator for a finite population total and have developed the design consistency of the estimator under standard assumptions on the sample design and finite population. Following a model-assisted framework, the estimator is a generalized regression estimator where the working model is a regression tree. An implementation of the estimator will be available in an R package on The Comprehensive R Archive Network (CRAN).

The regression tree estimator is also a post-stratification estimator since a regression tree can be written as a linear model where each variable is an indicator function for the sequential splits to an end node of the tree. Therefore, the regression tree can be viewed as a data-driven technique for choosing the appropriate set of post-strata. These post-strata have the ability to capture complex interactions between the variables and as shown in the simulations, they can increase the efficiency of the model-assisted estimator. Additionally, the estimator is calibrated to the population totals of each post-strata.

Since the auxiliary data available for establishment surveys are often mostly categorical, we have focused on useful features of how regression trees handle categorical data, such as, collapsing categories into homogeneous sub-groups and capturing predictive interactions between specific categories. However, regression trees are also useful for quantitative auxiliary data, especially if the data have a non-linear relationship with the study variable.

There are a couple of extensions of the regression tree estimator to be considered in future work. If the study variable relates linearly with an auxiliary variable, then a simple linear regression model instead of a simple mean model could be fit to each of the end nodes. Another option would be to use a random forest

[6], which is a collection of regression trees fit on subsets of the variables and/or data, in the generalized regression estimator since a random forest tends to have greater predictive power than a single regression tree.

Appendix: Assumptions and Proof of Theorem 2

Rate conditions: Define two functions of that will be used as rates of convergence. Let and be given functions bounded above 0 for all satisfying:

  1. :

  2. :

  3. :

Assume we have a sequence of nested populations, . Let the sample, , be selected according to a sample design with sample size . Denote the first-order inclusion probabilities by and the second-order inclusion probabilities by . Let be an indicator function representing whether unit is or is not in the sample. Let and denote the expectation and the variance evaluated with respect to the sample design, We suppress the subscript in as well as in, , and for simplicity of notation. Assume:

  1. For any set , with -probability 1.

  2. For , where and for , where

  3. The sampling fraction is given by Let and .

[Proof of Theorem 2]

By Markov’s Inequality, we can achieve asymptotic design unbiasedness and design consistency by showing that

Without loss of generality, assume the support of is . First, we define a population mean estimator of using the sample-based partition, ,


Proposition 1 of [21] prove the mean square error consistency of the sample estimator, to the superpopulation mean, . Within the proof of the result, mean square error consistency of the sample estimator to the finite population estimator using the sample-based partition is provided. In particular, under assumptions A1 -A5, Proposition 1 of [21] implies that .

By adding and subtracting

and applying the Triangle Inequality, we get

For the first term, we have

Applying the law of total expectation to the first term gives

by conditions (A1), (A2), (A4) and (A5) and rate condition (R3). Similarly, the second term becomes

by conditions (A1) – (A5) and rate condition (R3).

By rewriting the mean functions and applying the Cauchy Schwarz Inequality to , we obtain

by Proposition 1 in Toth and Eltinge [21] and by Conditions (A2) and (A3).


  • Bechtel et al. [2012] Bechtel, L. T., Morris, D. S. and Thompson, K. J. (2012) Using classification trees to recommend hot deck imputation methods: A case study.
  • BLS [2008] BLS (2008) Bls handbook of methods, chapter 3, "occupational employment statistics". URL:
  • Breidt et al. [2005] Breidt, F. J., Claeskens, G. and Opsomer, J. D. (2005) Model-assisted estimation for complex surveys using penalised splines. Biometrika, 92, 831–846.
  • Breidt and Opsomer [2000] Breidt, F. J. and Opsomer, J. D. (2000) Local polynomial regression estimators in survey sampling. Annals of Statistics, 28, 1026–1053.
  • Breidt and Opsomer [2017] — (2017) Model-assisted survey estimation with modern prediction techniques. Statistical Science, To Appear.
  • Breiman [2001] Breiman, L. (2001) Random forests. Machine learning, 45, 5–32.
  • Deville and Särndal [1992] Deville, J.-C. and Särndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376–382.
  • Deville et al. [1993] Deville, J.-C., Särndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American statistical Association, 88, 1013–1020.
  • Ellis and Thompson [2015] Ellis, Y. and Thompson, K. J. (2015) Exploratory data analysis of economic census products: Methods and results. In Proceedings of the Section on Survey Methods.
  • Hájek [1971] Hájek, J. (1971) Comment on a paper by D. Basu. In Foundations of Statistical Inference (eds. V. P. Godambe and D. A. Sprott), 236. Toronto: Holt, Rinehart and Winston.
  • Horvitz and Thompson [1952] Horvitz, D. G. and Thompson, D. J. (1952) A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.
  • Lohr et al. [2015] Lohr, S., Hsu, V. and Montaquila, J. (2015) Using classification and regression trees to model survey nonresponse. In Proceedings of the Survey Research Methods Section, 2071–2085.
  • McConville and Breidt [2013] McConville, K. S. and Breidt, F. J. (2013) Survey design asymptotics for the model-assisted penalised spline regression estimator. Journal of Nonparametric Statistics, 25, 745–763.
  • Montanari and Ranalli [2005] Montanari, G. E. and Ranalli, M. G. (2005) Nonparametric model calibration estimation in survey sampling. Journal of the American Statistical Association, 100, 1429–1442.
  • Morgan and Sonquist [1963] Morgan, J. N. and Sonquist, J. A. (1963) Problems in the analysis of survey data, and a proposal. Journal of the American statistical association, 58, 415–434.
  • Phipps and Toth [2012] Phipps, P. and Toth, D. (2012) Analyzing establishment nonresponse using an interpretable regression tree model with linked administrative data. The Annals of Applied Statistics, 772–794.
  • R Core Team [2016] R Core Team (2016) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL:
  • Särndal et al. [1992] Särndal, C. E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. New York: Springer-Verlag.
  • Smith [1991] Smith, T. M. F. (1991) Post-stratification. The Statistician, 315–323.
  • Toth [2017] Toth, D. (2017) rpms: Recursive Partitioning for Modeling Survey Data. R package version 0.2.1.
  • Toth and Eltinge [2011] Toth, D. and Eltinge, J. (2011) Building consistent regression trees from complex sample data. Journal of the American Statistical Association, 106, 1626–1636.