Having accurate and detailed information on social and economic conditions, summarised by appropriate indicators, is imperative for the efficient implementation of policies. The term detailed is used to signify information that extends beyond aggregate levels into highly disaggregated geographical and other domains (e.g. demographic groups). The term accurate refers to information that is estimated with appropriate level of precision and is comparable over space and time. Simply analysing data from national sample surveys is not enough for achieving the dual target of accurate and detailed information. This is mainly due to the reduction of sample sizes as the level of required detail increases. The achievement of this dual target demands appropriate model-based methodology collectively referred to as Small Area Estimation (SAE).
SAE methods can be broadly divided in two classes: first, area-level models (Fay_Heriot1979) assume that only aggregated data for the survey and for the auxiliary information is available. Second, unit-level models (Battese_etal1988) - further labelled as BHF - require access to the survey and to the auxiliary information on micro-level. A versatile extension of the BHF model is the EBP-approach by Molina_rao2010. The EBP is capable to estimate area-level means as well as other linear and non-linear indicators. Both classes (area-level and unit-level models) are predominantly regression-based models, where the hierarchical structure of observations is modelled by random effects. These linear mixed models (LMM) assume normality of random effects and error terms. Focusing on social and economic inequality data, the required assumptions for LMMs hardly meet empirical evidence. JiangRao2020 remind, that optimality results and predictive performance in model-based SAE are inevitably connected to the validity of model assumptions. Without theoretical and practical considerations regarding improperly met assumptions, estimates are potentially biased and mean squared error (MSE) estimates unreliable.
One strategy to prevent model-failure, is the assurance of normality by transforming the dependent variable (sugasawa2017transforming; Rojas_etal2019). For instance, Rojas_etal2019 generalize the EBP with a data-driven transformation on the dependent variable, such that normality assumptions can be met in transformed settings. Further details on how to obtain the most-likely transformation parameter that improves the performance of unit-level models are available in Rojas_etal2019 and sugasawa2019adaptively or from a more applied perspective in Tzavidis_etal2018. Apart from transformation strategies, another alternative is the use of models with less restrictive (parametric) assumptions to avoid model-failure. For instance, DialloRao2018 and Graf_etal2019 formulate the EBP under more flexible distributional assumptions. Semi- or non-parametric approaches for the estimations of area-level means were investigated among others by Opsomeretal2008. Opsomeretal2008 use penalized splines regression, treating the coefficients of spline components as additional random effects within the LMM setting.
A distinct methodological option to avoid the parametric assumptions of LMMs are the class of machine learning methods. These methods are not only limited to parametric models and ‘learn’ predictive relations from data, including higher order interactions between covariates, without explicit model assumptions(Hastie_etal2009; Varian2014). Among the broad class of machine learning methods, we focus on tree-based models and particularly on random forests (Breiman2001)
because they exhibit excellent predictive performance in the presence of outliers and implicitly solve problems of model-selection(Biau_Scornet2016). In general, the predictive perspective of (tree-based) machine learning methods conceptually transfers straight forward to the methodology of unit-level SAE-models: survey data is used to construct a model with predictive covariates. Subsequently, auxiliary information from a supplementary data source (usually census, register or administrative data) is utilized to obtain predictions over sampled and non-sampled areas. From a machine learning perspective, the survey data serves as an implicit training set to construct a proper model, while supplementary data is used to predict final indicators. Nevertheless, JiangRao2020 claim, that results from machine learning methods in SAE are harder to be interpreted and justified by SAE-practitioners, compared to LMM-alternatives.
We aim to fill this gap by providing a consistent framework enabling a coherent use of tree-based machine learning methods in SAE. In particular, we proposes a non-linear, data-driven, and semi-parametric alternative for the estimation of area-level means by using mixed effects random forests in the methodological tradition of SAE. Random effects are used to account for the hierarchical structure of the survey data. A non-parametric bootstrap estimator is introduced for assessing the uncertainty of the area-level estimates. Using design- and model-based simulations, we highlight strengths and weaknesses of random forests in the context of SAE, in comparison to existing (or ‘traditional’) SAE-methods. Thus, this paper aims to contribute towards the trend of diversifying the model-toolbox for SAE-practitioners and researchers, while simultaneously respecting the methodological and structural nature of SAE.
The general idea of tree-based methods in SAE is not entirely new. Anderson_etal2014 use district-level data from Peru to juxtapose the performance of LMM-based and tree-based methods for estimating population densities. Bilton_etal2017
use classification trees for categorical variables to incorporate auxiliary information from administrative data to survey data on household-poverty in Nepal. For a continuous variableDeMolinerGoga2018 estimate mean electricity consumption curves for sub-populations of households in France by using methods of LMMs and regression-based trees. MoConvilleetal2019 propose a regression-tree estimator for finite‐population totals, which can be viewed as an automatically-selected post‐stratification estimator. Dagdougetal2020 analyse theoretical properties of random forest in the context of complex survey data. Mendez2008 provides theoretical and empirical considerations for using random forests in the context of SAE and compares their performance with ‘traditional’ unit-level LMMs for the estimation of area-level means. Although we share the general idea of Mendez2008, the approach of this paper differs in several ways: first of all, we leave the random forests algorithm (Breiman2001)
unchanged and explicitly estimate random effects to account for the hierarchical structure of the data. Secondly, the proposed framework of this paper is more flexible and potentially extendable to model more complex hierarchical dependency structures as well as spatial and temporal correlations. Additionally, the extension to other methods of machine learning is possible, such as support vector machines or gradient boosted trees.
The paper is organized as follows: Section 2 provides a methodological introduction to random forests and introduces Mixed Effects Random Forests (MERF) based on Hajjem2014 as a method that effectively amalgamates random forests and the possibility to account for hierarchical dependencies of unit-level observations. Additionally, we motivate a general unit-level mixed model, treating LMMs in SAE as special cases. In Section 2.3, we discuss the construction of area-level mean-estimates. Random forests promote the flexibility of predictive models, but their lack of distributional assumptions complicates inferences. As a result, Section 3 proposes a non-parametric bootstrap-scheme for the estimation of the area-level MSE. In Section 4, we use model-based simulations under complex settings to extensively discuss and compare the performance of the proposed method for point- and MSE-estimates. We claim MERFs to be a valid alternative to existing methods for the estimation of SAE-means. In Section 5, we use household income data of the Mexican state Nuevo León to estimate area-level averages and corresponding uncertainty estimates. We highlight modelling and robustness properties of our proposed methods. Section 5.3 proceeds with a design-based simulation, which asses the quality of results in the application of Section 5.2. Furthermore, the design-based simulation contributes to a genuine demonstration of properties and advantages of MERFs in the context of SAE. Section 6 concludes and motivates further research.
2 Theory and method
In this section we propose a flexible, data-driven approach using random forests for the estimation of area-level means in the presence of unit-level survey data. The method requires a joint understanding of tree-based modelling techniques and concepts of SAE. We review the basic theory of random forest and discuss modifications to ensure their applicability to hierarchical data and subsequently to applications of SAE.
2.1 Review of random forests
Random forests combine individual decision trees(Breiman_etal1984)
to improve their joint predictive power, while simultaneously reducing their prediction variance(Biau_Scornet2016; Breiman2001). Breiman2001 extends his idea of Bagging (Breiman1996bagging) - which combines predictions from single trees through a bootstrap and aggregation procedure - to random forests that apply bootstrap aggregation on decorrelated trees. Note that the two tuning parameters of random forests are the number of trees (controlling the number of bootstrap replications) and the number of variables to be selected as candidates for each split (controlling the degree of decorrelation). Because the forest is a combination of decorrelated trees, each aiming to minimize the prediction MSE, the optimal estimator for the random forest regression function also minimizes the point-wise MSE. The minimizer under squared error loss in the regression context, is given by the conditional mean of target variable given the data (Efron_Hastie2016; Wager_Athey2018).
Random forests captivate with a lack of assumptions such as linearity or the distributional specification of model errors, however, observations are assumed to be independent. Applications of SAE are characterized by the use of hierarchical data. Ignoring the correlation between observations, generally results in inferior point-predictions and inferences. LMMs capture the dependencies between observations by random effects, while effects between covariates are modelled by linear fixed effects, resulting in an additive model of both terms. In the context of tree-based methods, Sela_Simonoff2012 propose a semi-parametric mixed model consisting of a random effects part and a fixed effects non-parametric tree-model. Hajjem_etal2011 propose a similar approach under the label of mixed effect regression trees (MERT). As the superior performance of random forests over regression trees transfers to dependent data, Hajjem2014 replace the fixed effects part in MERTs by a random forest, leading to mixed effects random forests (MERF). We scrutinize this approach and propose a general semi-parametric unit-level mixed model combining the flexibility of tree-based models with the structural advantages of linear mixed models in the next subsection.
2.2 Mixed effects random forests
We assume a finite population which is divided into disjunct areas , with population sizes , where specifies the areas and defines the population size. We assume to have a sample from this population. The number of sampled observations in area is given by , where the sample size is denoted by . In each sampled area we obtain individual observations ranging from .
We define the metric target variable for area as a vector of individual observations . Covariates are captured in the matrix of , where defines the number of covariates. defines the matrix of area-specific random effect specifiers, where describes the dimension of random effects. is the vector of random effects for area . is the vector of individual error terms. Observations between areas are assumed to be independent and and
are mutually independently normally distributed with the same variance-covariance matrixfor random effects of each area and for individual errors. A joint notation for all areas is as follows:
The goal is to identify a relation between covariates and the target variable , in order to predict values for non-sampled observations utilizing available supplementary covariates from census or register information across areas. We state a model consisting of two major parts: a fixed part and a linear part capturing dependencies by random effects. In the following, can be any parametric or non-parametric function that expresses the conditional mean of target variable given covariates :
Note that for each area the following model holds:
The covariance matrix of the observations is given by the block diagonal matrix , where . We introduce model (1) in general terms to potentially allow for modelling of complex covariance and dependency structures. However, for the rest of the paper we assume that correlations arises only due to between-area variation, i.e. for all areas . Note that the already mentioned LMM proposed by Battese_etal1988 for estimating area-level means results as a special case of (1) by setting to be the linear model , with regression parameters . Defining as a random forest, results in the MERF-approach proposed by Hajjem2014, which is the preferred specification throughout the rest of the paper.
Before we continue, we want to clarify consequences of distributional assumptions in (1) that mainly address the linear part of the model. The unit-level errors are assumed to follow . However, the assumption of normality does not affect the properties of the fixed part
and we do not require residuals to be normally distributed for the application of our proposed method. Nevertheless, for the random components part, we require a proper likelihood function to ensure that the adapted expectation-maximization (EM) algorithm (see below for further details) for parameter estimates converges towards a local maximum within the parameter space. A normality-based likelihood function is exploited, as it has two important properties: firstly, it facilitates the estimation of random effects due to the existence of a closed-form solution of the integral of the Gaussian likelihood function. Secondly, the maximum likelihood estimate for the variance of unit-level errors is given by the mean of the unit-level residual sum of squares. The estimation of the random effects could be also done in a non-parametric way by using discrete mixtures(Marino2016; Marino2019). However, the modification towards a fully non-parametric formulation of model (1) is subject to further research.
For fitting the model (1) we use an approach reminiscent of the EM-algorithm similar to Hajjem2014. In short, the MERF-algorithm subsequently estimates a) the forest function, assuming the random effects term to be correct and b) estimates the random effects part, assuming the Out-of-Bag-predictions (OOB-predictions) from the forest to be correct. OOB-predictions utilize the unused observations from the construction of each forest’s sub-tree (Breiman2001; Biau_Scornet2016). The proposed algorithm is as follows:
Initialize and set random components to zero.
Set . Update and :
Estimate using a random forest with dependent variable and covariates . Note that is the same function for all areas .
Get the OOB-predictions .
Fit a linear mixed model without intercept and restricted regression coefficient of 1 for :
Extract the variance components and and estimate the random effects by:
Repeat Step (2) until convergence is reached.
The convergence of the algorithm is assessed by the marginal change of the modified generalized log-likelihood (GLL) criterion:
In the linear case with , and for given variance components and , the maximization of the GLL-criterion is equivalent to the solution of so-called mixed model equations (Wu_Zhang2006) - leading to best linear unbiased predictor (BLUP): . For random forests, the corresponding estimator for known parameters and is given by:
Mathematical details of the derivations are provided in Appendix A. This result is in line with capitaine_etal2021, claiming that is obtained by taking the conditional expectation given the data and subsequently can thus be considered as the BLUP for the linear part of model (1).
The estimation of variance components in Step 2 (d) for and is obtained by taking the expectation of maximum likelihood estimators given the data. Although is a naive estimator within the discussed framework, it cannot be considered as a valid estimator for the variance of the unit-level errors . Breiman2001 maintains that the sum of squared residuals from OOB-predictions are a valid estimator for the squared prediction error of new individual observations. However, as an estimator of the residual variance under the model, is positively biased, as it includes uncertainty regarding the estimation of function . Following Mendez_Lohr2011 we use a bias-adjusted estimator for the residual variance from a random forest model (1) using a bootstrap bias-correction. The essential steps to obtain the corrected residual variance are summarized as follows:
Use the OOB-predictions from the final model after convergence of the algorithm.
Generate bootstrap samples , where the values are sampled with replacement from the centred marginal residuals .
Recompute using a random forest with as dependent variable.
Estimate the correction-term by:
The bias-corrected estimator for the residual variance is then given by:
2.3 Predicting small-area averages
The MERF-model (1) predicts the conditional mean on individual level of a metric dependent variable given unit-level auxiliary information. In the context of SAE, we are not interested in predictions on individual level, but in estimating indicators such as area-level means or area-level totals (Rao_Molina2015). Thus, we assume the same structural simplifications as the LMM proposed by Battese_etal1988 for estimating area-level means throughout the paper, i.e. , is a design-matrix of area-intercept indicators, is a vector of random effects, and variance-covariance matrix for random effects simplifies to .
Firstly, we use the fact that random forest estimates of the fixed part express the conditional mean on unit-level. We calculate the mean-estimator for each area based on available supplementary data sources (usually census or administrative data) by:
In cases of non-sampled areas, the proposed estimator for the area-level mean reduces to the fixed part from the random forest:
3 Estimation of uncertainty
The assessment of uncertainty of area-level indicators in SAE is crucial to analyse the quality of estimates. The area-level MSE is a conventional measure fulfilling this goal, but its calculation is a challenging task. For instance, for the unit-level LMM with block diagonal covariance matrices (Battese_etal1988), the exact MSE cannot be analytically derived with estimated variance components (Gonzalez_etal2008; Rao_Molina2015) and only partly-analytical approximations are available (Prasad_Rao1990; Datta_Lahiri2000). An alternative to estimate uncertainty of the area-level indicators are bootstrap-schemes (Hall_Maiti2006; Gonzalez_etal2008; Chambers_Chandra2013). In contrast, general statistical results for inference of random forests are rare, especially in comparison to the existing theory of inference using LMMs. The theoretical background of random forests grows, but mainly aims to quantify the uncertainty of individual predictions (Sexton_Laake2009; wager_etal2014; Wager_Athey2018; Athey_etal2019; Zhang2019). The extension of recent theoretical results, such as conditions for the consistency of unit-level predictions (Scornet_etal2015) or their asymptotic normality (Wager_Athey2018), towards area-level indicators is a conducive topic for further research.
In this paper, we propose a non-parametric random effect block (REB) bootstrap for estimating the MSE of the introduced area-level estimator given by equation (5). We aim to capture the dependence-structure of the data as well as the uncertainty introduced by the estimation of model (1). Our bootstrap-scheme builds on the non-parametric bootstrap introduced by Chambers_Chandra2013. The proposed REB bootstrap has two major advantages: firstly, empirical residuals depend only on the correct specification of the mean behaviour function of the model, thus the REB setting is lenient to specification errors regarding the covariance structure of the model. Secondly, the bootstrap within blocks ensures that the variability of residuals within each area is captured. We scale and centre the empirical residuals by the bias-corrected residual variance (4) in order to eliminate the uncertainty due to the estimation of from the naive residuals. The steps of the proposed bootstrap are as follows:
For given calculate the vector of marginal residuals and define .
Using the marginal residuals , compute level-2 residuals for each area by
and indicates the vector of level-2 residuals.
To replicate the hierarchical structure we use the marginal residuals and obtain the vector of level-1 residuals by . The residuals are scaled to the bias-corrected variance (4) and centred, denoted by . The level-2 residuals are also scaled to the estimated variance and centred, denoted by .
Sample independently with replacement from the scaled and centred level-1 and level-2 residuals:
Calculate the bootstrap population as and calculate the true bootstrap population area means as for all .
For each bootstrap population draw a bootstrap sample with the same as the original sample. Use the bootstrap sample to obtain estimates and as discussed in Section 2.2.
Calculate area-level means following Section 2.3 by
Using the bootstrap samples, the MSE estimator is obtained as follows:
4 Model-based simulation
This section marks the first step in the empirical assessment of the proposed method. The model-based simulation juxtaposes point estimates for the area-level mean from the mixed effects random forest model (1) with several competitors. In particular, we study the performance of MERFs compared to the BHF-model (Battese_etal1988), the EBP (Molina_rao2010) as well as the EBP under data-driven Box-Cox transformation (EBP-BC) by Rojas_etal2019. The BHF-model serves as an established baseline for the estimation of area-level means and the EBP and the EBP-BC conceptually build on the BHF-model. Differences in the performance of the EBP and the EBP-BC highlight advantages of data-driven transformations, while differences in the performance of the linear competitors and the MERF indicate advantages of semi-parametric and non-linear modelling. Overall, we aim to show, that our proposed methodology for point and uncertainty estimates performs comparably well to ‘traditional’ SAE methods and has comparative advantages in terms of robustness against model-failure.
The simulation-setting is characterized by a finite population of size with disjunct areas of equal size . We generate samples under stratified random sampling, utilizing the small areas as stratas, resulting in a sample size of . The area-specific sample sizes range from to sampled units with a median of and a mean of . The sample sizes are comparable to area-level sample sizes in the application in Section 5 and can thus be considered to be realistic.
We consider four scenarios denoted as Normal, Interaction, Normal-Par, Interaction-Par and repeat each scenario independently
times. The comparison of competing model-estimates under these four scenarios allows us to examine the performance under two major dimensions of model-misspecification: Firstly, the presence of skewed data delineated by non-normal error-terms and secondly, the presence of unknown non-linear interactions between covariates. ScenarioNormal provides a baseline under LMMs with normally distributed random effects and unit-level errors. As model-assumptions for LMMs are fully met, we aim to show, that MERFs perform comparably well to linear competitors in the reference scenario. Scenario Interaction shares its error-structure with Normal, but involves a complex model including quadratic terms and interactions. This scenario portrays advantages of semi-parametric and non-linear modelling methods protecting against model-failure. Working with inequality or income data, we often deal with skewed target variables. Thus, we use the Pareto distribution to mimic realistic income scenarios. Scenario Normal-Par uses the linear additive structure of LMMs and adds Pareto distributed unit-level errors. The resulting scenario, including a skewed target variable, is a classical example promoting the use of transformations assuring that assumptions of LMMs to be met. Finally, scenario Interaction-Par combines the two discussed dimensions of model misspecification, i.e. a non-Gaussian error-structure with complex interactions between covariates. We chose this scenario to emphasize the ability of MERFs to handle both complications simultaneously. Further details on the data-generating process for each scenario are provided in Table 1.
We evaluate point estimates for the area-level mean by the relative bias (RB) and the relative root mean squared error (RRMSE). As quality-criteria for the evaluation of the MSE-estimates, we choose the relative bias of RMSE (RB-RMSE) and the relative root mean squared error of the RMSE:
where is the estimated mean in area based on any of the methods mentioned above and defines the true mean for area in simulation round . is estimated by the proposed bootstrap in Section 3 and is the empirical root MSE over replications.
For the computational realization of the model-based simulation, we use R (R_language). The BHF estimates are realized from the sae-package (Molina_Marhuenda:2015) and the emdi-package (Kreutzmann_etal2019) is used for the EBP as well as the EBP under the data-driven Box-Cox transformation. For estimating the proposed MERF-approach, we use the packages randomForest (Liaw_Wiener2002) and lme4 (Bates_etal2015). We monitor the convergence of algorithm introduced in Section 2.2 with a precision of in relative difference of the GLL-criterion and set the number of split-candidates to , keeping the default of trees for each forest.
We start with a focus on the performance of point estimates. Figure 1 reports the empirical RMSE of each method under the four scenarios. As expected, in the Normal scenario, the BHF and the EBP perform on the same level and outperform the MERF estimator. The EBP with a data-driven transformation (EBP-BC) leads to similar results compared to the BHF and EBP. This shows that the data-driven transformation works as expected. A similar pattern appears in the results from the Normal-Par scenario, except that the EBP-BC reaches a lower overall RMSE due to its property of data-driven transformation and subsequently improved estimation under skewed data. As anticipated, a comparison of the performance of the MERF between the Normal and the Normal-Par scenario indicates, that the MERF exhibits robust behaviour under skewed data and subsequently regarding violations of the normal distribution of errors. For complex scenarios, i.e. Interaction and Interaction-Par, point estimates of the proposed MERF outperform the SAE methods based on LMMs. Nevertheless, the EBP-BC performs better in terms of lower RMSE values compared to the BHF and the EBP in both interaction scenarios. Overall, the results from Figure 1 indicate that the MERF performs comparably well to LMMs in simple scenarios, and outperforms ‘traditional’ SAE-models in the presence of unknown non-linear relations between covariates. Additionally, the robustness against model-misspecification of MERFs holds if distributional assumptions for LMMs are not met, i.e. in the presence of non-normally distributed errors and skewed data. Table 2 reports the corresponding values of RB and RRMSE for our discussed point estimates. The RB and the RRMSE from the MERF-method attest a competitively low level for all scenarios. Most interestingly, in complex scenarios (Interaction and Interaction-Par), a familiar result regarding the statistical properties of random forests appears: the RB is higher compared to the LMM-based models, but the enlarged RB is rewarded by a lower RRMSE for point estimates.
We scrutinize the performance of our proposed MSE-estimator on the four scenarios, examining whether the observed robustness against model-misspecification due to unknown complex interactions between covariates or skewed data for point estimates, also holds for our non-parametric bootstrap-scheme. For each scenario and each simulation round, we choose the parameter of bootstrap replications . From the comparison of RB-RMSE among the four scenarios provided in Table 3, we infer, that the proposed non-parametric bootstrap-procedure effectively handles scenarios that lead to model-misspecification in cases of (untransformed) LMMs. This is demonstrated by essentially unbiasedness in terms of mean and median values of RB-RMSE over areas of the MSE-estimator under all four scenarios: independently, whether the data generating process is characterized by complex interactions (Interaction), non-normal error terms (Normal-Par) or a combination of both problems (Interaction-Par). Additional information on the RB-RMSE regarding the non-parametric bootstrap is available in Figure 7 in the Appendix B.
From the results of Table 3 and the subsequent discussion, we cannot directly infer the area-wise tracking properties of the estimated RMSE against the empirical RMSE over our simulation rounds. Thus, Figure 2 provides additional intuition on the quality of our proposed non-parametric MSE-bootstrap estimator. Given the tracking properties in all four scenarios, we conclude that our MSE-estimates strongly correspond to the empirical RMSE. Furthermore, we do not observe systematic differences between the estimated and empirical MSE-estimates regarding different survey-sample sizes.
5 Application: Estimating household income for Mexican municipalities
In this section, we discuss the performance of our proposed method in the context of a genuine SAE example. Concretely, we apply the MERF-method proposed in Section 2.2 to estimate domain-level average household income for the Mexican state Nuevo León. Section 5.1 describes the data and Section 5.2 reports results. We end our empirical analysis by an additional design-based simulation enabling a profound discussion on the quality of point and MSE-estimates in Section 5.3.
5.1 Data description
Income inequality in Mexico is a research topic of timeless importance, particularly regarding the effectiveness of developmental and social policies (lambert_hyunmin2019). Although the level of income inequality in Mexico is comparable to other Latin American countries, it is one of the highest among other OECD countries (Oecd_21). Analysing average national income values is a common practice, but constitutes an inappropriate measure to monitor the efficacy of regional policy measures. Besides detailed disaggregated information, also suitable statistical methods are needed to quantify local developments. For the following application, we break down regional differences in average household per capita income of one of 32 Mexican states. Nuevo León is located in the North-East of Mexico and according to the (sub-national) Human Development Index (HDI), it is one of the most developed states of Mexico (smits_Permanyer2019). Nevertheless, the distribution of individual household income in Nuevo León is unequal and thus highly skewed. For instance, the Gini-coefficient of household income is comparable to the total Gini of Mexican household disposable income from 2012 which was (Oecd_21).
We use data from 2010 provided by CONEVAL (Consejo Nacional de Evaluación de la Política de Desarrollo Social), combining the Mexican household income and expenditure survey (Encuesta Nacional de Ingreso y Gastos de los Hogares, ENIGH) with a sample of census microdata by the National Institute of Statistics and Geography (Instituto Nacional de Estadística y Geografía). The dataset comprises income and socio-demographic data, equally measured by variables that are part of the survey as well as the census data. The target variable for the estimation of domain-level average household income in Section 5.2, is the total household per capita income (ictpc, measured in pesos), which is available in the survey but not in the census.
Nuevo León is divided into municipalities. While the census dataset in our example comprises information on households from all municipalities, the survey data includes information on households from municipalities, ranging from a minimum of to a maximum of households with a median of households. This leaves administrative divisions to be out-of-sample. Table 4 provides details on sample and census data properties.
|Min.||1st Qu.||Median||Mean||3rd Qu.||Max.|
|Survey area sizes||5.00||14.00||27.00||68.33||79.00||342.00|
|Census area sizes||76.00||454.50||642.00||1075.45||872.50||5904.00|
With respect to the design-based simulation in Section 5.3, we emphasize that we are in the fortunate position of having a variable that is highly correlated with the target variable ictpc in the application and that is available in the survey and in the census dataset: the variable inglabpc measures earned per capita income from work. Although inglabpc deviates from the desired income definition for our approach - as it covers only one aspect of total household income - it is effective to evaluate our method under a design-based simulation in Section 5.3. Furthermore, the design-based simulation implicitly assess the quality of our empirical results from Section 5.2 .
Using data from Nuevo León for the estimation of domain-level income averages, is an illustrative and realistic example and imposes several challenges on the proposed method of MERFs: first of all, about percent of households in the survey-sample are located in the capital Monterrey. Secondly, there exist more out-of sample domains than in-sample domains. Moreover, we are confronted with households reporting zero-incomes. Our intention in choosing this challenging example for the application part in Section 5.2 and the following design-based simulation in Section 5.3 are simple: we aim to show, that our proposed approaches for point and uncertainty estimates demonstrate a valid alternative to existing SAE-methods and are predominantly applicable for cases where ‘traditional’ methods perform poorly or even fail. Additionally, we aim to provide a clear-cut presentation and empirical assessment of MERFs for SAE, which requires a transparent discussion of advantages and potential weaknesses in demanding real-world examples.
5.2 Results and discussion
Direct estimates for average total household per capita income for Nuevo León are possible for 21 out of 51 domains. The use of model-based SAE methods incorporating covariate census data will not only lead to estimates for the remaining out-of-sample areas, but correspondingly improve the overall quality of estimates (Tzavidis_etal2018). As variable ictpc is highly skewed, we deduce potential issues of model-misspecification and suggest the use of the EBP-BC and the MERF. Given the theoretical discussion and the results in the model-based simulation in Section 4, we infer that the EBP-BC and the proposed method of the MERFs for SAE effectively handle non-normally distributed data. Moreover, we are particularly interested in differences between these two diverse SAE-models in the context of real-world applications. Figure 3 maps results from direct estimates, the MERF and the EBP-BC. Obviously, the model-based estimates from the MERF and the EBP-BC expand the perspective of regional disparities in average total household income per capita towards non-sampled regions. Furthermore, we identify three distinct clusters of income levels from our results in Figure 3 that had not been observable from the mapping of direct estimates: a low income cluster in the South of Nuevo León, a very high income cluster in the metropolitan area of the capital Monterrey and a group of middle-income areas between the North and the South of the state. This finding illustrates, the potential of model-based techniques to highlight patterns of regional income disparities and enable the mapping of reliable empirical evidence. Given the information provided by the three maps, we do not report major differences between the point estimates of MERFs and the EBP-BC.
Apart from mapping empirical results of domain averages, we are mainly interested in quality criteria, such as the coefficients of variation (CV) and the details of model-specification for the EBP-BC and the MERF. To obtain estimates of variances for the direct estimates, we use the calibrated bootstrap method (Alfons_Templ2013) as provided in the R package emdi (Kreutzmann_etal2019). For the model of the data-driven EBP-BC, we follow the approach of Rojas_etal2019 and use the Bayesian Information Criterion (BIC) to identify valid predictors for the target variable of ictpc. The resulting working-model includes variables determining occupation, sources of income, the socio-economic level and educational aspects of individual households. The identification of predictive covariates for MERFs highlights a conceptual difference to LMM-based methods. Due to the properties of the random forest algorithm (Breiman2001), random forests perform an implicit variable selection. The selected model for fixed effects in our case is characterized by an R-squared of about percent. The dilemma between predictive precision and the interpretability of random forest models can be mitigated by concepts such as variable importance plots (Greenwell_etal2020) or an analysis of partial dependence for influential covariates (Greenwell_2017). We provide graphical representations in Figure 8 and Figure 9 in the Appendix B. We monitor the convergence of the proposed MERF algorithm under a precision of in relative difference of the GLL criterion and keep the default of trees. A parameter optimization based on 5-fold cross-validation on the original survey-sample advices the use of variables at each split for the forest. For the MSE bootstrap-procedure, we use .
Figure 4 reports corresponding CVs for in- and out-of-sample domains. We observe a significant improvement for in-sample CVs of the EBP-BC and the MERF compared to the CVs for direct estimates. CVs of MERFs are slightly lower in median terms than the results for the EBP-BC. However, there exists one outlying area for MERFs. Going into details, the corresponding area of General Zaragoza features no obvious data-specific irregularities, such as extremely low sample size. Nevertheless, General Zaragoza is one of the poorest regions according to our analysis. In the design-based simulation in Section 5.3, we will pay special attention to differences between MERFs and the EBP-BC regarding their ability to handle comparably extreme estimates given a broad range of relatively high and relative low income areas.
Regarding the CVs for out-of-sample areas, we discover an evident advantage of CVs from our proposed MERF approach. From the scrutiny of individual CV values it remains unclear, whether the asset of improved results from MERFs roots in superior point estimates for domain-level averages or its relatively lower MSE-estimates. Figure 5 compares direct estimates to the model-based estimates for in and out-of-sample domains. Apparently, there exist no systematic differences between the estimates from the EBP-BC and the MERF. Thus, it appears as if the variance of MERF predictions is generally lower. This conjecture is in line with the theoretical properties of random forests (Breiman2001; Biau_Scornet2016). In-sample areas in Figure 5 are sorted by survey-sample sizes. In comparison to the direct estimates, predicted averages of the EBP-BC as well as of the MERF seem less extreme. The obvious irregularity in terms of high-income, is a distinct part of the Monterrey metropolitan area: San Pedro Garza García registers several headquarters of national and international corporations. This economic peculiarity apparently transfers to the income of its citizenship. Figure 3, underlines the existence of an apparent high-income cluster in this region. Overall, it is interesting to observe how reliable estimates on fine spatial resolution unveil patterns of regional income segregation. Our proposed method of MERFs provides useful results with remarkably higher accuracy than direct estimates and the EBP-BC for most of out-of-sample domains. The following design-based simulation will strengthen the reliability of results and enable an in-depth discussion of our methods for point and MSE-estimates.
5.3 Evaluation using design-based simulation
The design-based simulation allows to directly juxtapose the performance of the proposed MERF-approach to existing SAE-methods for the estimation of area-level means based on empirical data. In this sense, the design-based simulation adds not only insights to the results from the model-based simulation in Section 4, but also evaluates results from the example in the previous Section 5.3. We focus on area-level mean-estimates of household income from work inglabpc in the Mexican state Nuevo León. As we use the same data with a different target variable, sample and census data properties are similar to the previous example with details provided in Table 4. Implementing the design-based simulation, we sample independent samples from the fixed population of our census dataset. Each pseudo-survey-sample mirrors the characteristics of the original survey, as we keep the number of in-sample households similar to the original sample sizes and abstain from sampling out-of-sample municipalities. As a result, we use equally structured pseudo-survey-samples with equal overall sample size. True values, are defined as the domain-level averages of household incomes from work in the original census.
We consider the same methods as in the model-based simulation in Section 4. Comparably to Section 5.2, we use the same working-model for the BHF, the EBP and the EBP-BC and assume it to be fixed throughout the design-based simulation. For the EBP-BC and the MERF, we keep the parameters as already discussed in Section 5.2.
The discussion of results, starts with an investigation into the performance of point estimates. Figure 6 reports the average RMSE of the area-level mean-estimates for Nuevo León in total and with details on the in- and out-of sample areas. The corresponding summary statistics for Figure 6 are given in Table 8 in Appendix B. Regarding the total of areas, we observe no remarkable difference in the performance of the BHF and the EBP, whereas the EBP-BC has lower RMSE on average. The MERF point estimates indicate the lowest RMSEs among all areas resulting in an more than percent improvement compared to the BHF. Referring to the RMSE for in-sample areas, we see two different ways how the EBP-BC and the MERF deal with high and unbalanced variation in our true values for certain areas, ranging from to about pesos: overall, the MERF deals best in modelling the complex survey data and produces highly accurate estimates for the majority of in-sample areas. A closer look at the results, reveals however, that higher RMSE values due to overestimation mainly occur in two areas, both characterised by a very low level of income ( and pesos respectively). In contrast, we observe the in-sample behaviour of the EBP-BC, which clearly opposes its superior overall performance. The EBP-BC appears to balance extreme estimates by producing slightly worse estimates for each individual in-sample areas, than allowing for individually inferior estimates for specific ‘outlier’-areas. This behaviour is conceptually rooted in its data-driven transformation-approach. Nevertheless, this property enables the EBP-BC to identify a model, providing stable and precise estimates on the majority of areas, especially the 30 non-sampled areas. Given the data-scenario of Nuevo León, the performance on the out-of-sample areas, delineates each method’s quality and stability. In this case, the EBP-BC and MERF outperform the ‘traditional’ methods (BHF and EBP) in terms of lower RMSE. One advantage of the MERF is its adaptability and implicit model-selection that is rewarded in the presence of complex data-scenarios.
The findings from Figure 6, are strengthened by a discussion of mean and median values of RB and RRMSE in Table 5. Referring to all areas, the RB of the data-driven method of EBP-BC and the MERF is smaller in median values than the RB of BHF and the EBP. Respectively, the MERF shows the lowest median value of RB while mean-values lie in the same range with competing methods. The obvious difference between mean and median values, indicates the previously discussed existence of inferior estimates for specific regions due to the empirical properties of our underlying data. For the in-sample areas, the MERF performs superior to competing methods regarding the median values of RB and RRMSE. The close relation between the mean and median values of RB for the EBP-BC highlight the mentioned balancing-property of the EBP-BC. For the majority of areas in the model-based simulation, i.e. the non-sampled areas, the EBP-BC as well as the MERF exhibit a comparatively low level of median RB. Especially the MERF captivates by the lowest values in mean and median for RB and RRMSE compared to its competitors.
Finally, we focus on the performance of the proposed non-parametric MSE-bootstrap procedure. While, the model-based simulation in Section 4 indicates unbiasedness of the proposed bootstrap-scheme under all four scenarios, our results from the design-based simulation require a deeper discussion. Table 6 reports the results of RB-RMSE and the RRMSE-RMSE for the corresponding estimates. Figure 10 in the Appendix B visualizes details from Table 6. First of all, the values of RRMSE-RMSE are comparable to the most complex scenario in the model-based scenario in Section 4. The RB-RMSE for the in-sample areas indicates unbiasedness in median terms and an acceptable overestimation regarding the mean RB-RMSE. For out-of-sample areas, we face a moderate underestimation regarding the median value and over-estimation according to mean values. Nevertheless, Figure 10 in Appendix B reveals, that the mixed signal between mean and median in Table 6 is explained by a balanced mix of under- and over-estimation. Overall, the expectations towards the MSE-bootstrap procedure, given the challenging conditions of this design-based simulation, are met. Especially, the results for in-sample areas, combined with insights from the model-based simulation, indicate a solid and reliable performance of the proposed non-parametric bootstrap procedure. Although, the RB-RMSE for all areas is driven by the results from out-of-sample areas, the median RB-RMSE is acceptable. Apparently, the MSE-estimates mirror the high variation in sample sizes paired with high and dis-proportional variation of high-income and low-income regions between the in-sample and
out-of-sample areas. From an applied perspective, the MSE-estimates for out-of-sample areas are nevertheless practicable for the construction of confidence intervals, with a median coverage of.
6 Concluding remarks
In this paper, we explore the potential of tree-based machine learning methods for the estimation of SAE-means. In particular, we provide a solid framework easing the use of random forests for regression within the existing methodological framework of SAE. We highlight the potential of our approach to meet modern requirements of SAE, including the robustness of random forests against model-failure and the applicability for high-dimensional problems processing Big Data sources. The methodological part focusses on the MERF-procedure (Hajjem2014) and implicitly discusses a semi-parametric unit-level mixed model, treating LMM-based SAE-methods, such as the BHF and the EBP, as special cases. The model is fit by an algorithm resembling the EM-algorithm, allowing for flexibility in the specification to model fixed effects as well as random-effects. The proposed point estimator for area-level means is complemented by the non-parametric MSE-bootstrap-scheme, building on the REB-bootstrap by Chambers_Chandra2013 and the bias-corrected estimate for the residual variance by Mendez_Lohr2011. We evaluate the performance of point- and MSE-estimates compared to ‘traditional’ SAE-methods by model- and design-based simulations and provide a distinctive SAE example using income data from the Mexican state Nuevo Leòn in Section 5.2. The model-based simulation in Section 4 demonstrates the ability of point estimates to perform compatibly in classical scenarios and outperform ‘traditional’ methods in the existence non-linear interactions between covariates. The design-based simulation in Section 5.3 confirms the adequacy of MERFs for point estimates under searingly realistic conditions. The model- and design-based simulations indicate that the proposed approach is robust against distributional violations of normality for the random effects and for the unit-level error terms. Concerning our proposed MSE-bootstrap-scheme, we conclude its reliability based on the performance in the model-based simulation in Section 4. Furthermore, we obtain reasonable support for the performance in the application in Section 5.2 and the following design-based simulation in Section 5.3.
We motivate three major dimensions for further research, including theoretical work, aspects of generalizations and advanced applications using Big Data covariates: from a theoretical perspective, further research is needed to investigate the construction of a partial-analytical MSE for area-level means or the construction of an asymptotic MSE-estimator. A conducive strategy is the deduction of recent theoretical results, such as conditions for the consistency of unit-level predictions (Scornet_etal2015) or considerations of individual predictions intervals (wager_etal2014; Zhang2019), towards area-level indicators. Additionally, considerations concerning a fully non-parametric formulation of model (1) impose an interesting research direction. From a survey statistical perspective, our proposed method currently abstains from the use of survey weights which bears a risk if the assumption of non-informative sampling is violated. Nevertheless, there exist approaches incorporating weights into random forests (Winham_etal2013)
. The transfer of such ideas to the proposed method of MERFs is subject to ongoing research. Regarding additional generalizations of the proposed method, we aim to extend the use of MERFs towards the estimation of small area quantiles and other non-linear indicators, such as Gini-coefficients or Head Count Ratios. Furthermore, a generalization towards binary or count data is possible and left to further research. The semi-parametric composite formulation of model (1) allows for to adapt any functional form regarding the estimation of the conditional mean of given and technically transfers to other machine learning methods, such as gradient-boosted trees or support vector machines. In terms of advanced applications, we propose the use of MERFs for complex random effect and covariance-structures in empirical problems to the SAE-research community. Equally interesting is the use of high dimensional supplementary data, i.e. Big Data covariates, for the estimation of area-level means, that can be directly handled by the proposed MERF-framework.
The authors are grateful to CONEVAL for providing the data used in empirical work. The views set out in this paper are those of the authors and do not reflect the official opinion of CONEVAL. The numerical results are not official estimates and are only produced for illustrating the methods.
Additionally, the authors would like to thank the HPC Service of ZEDAT, Freie Universität Berlin, for computing time.
After convergence of the algorithm introduced in Section 2.2, we obtain an optimal non-parametric estimator for . The best predictor for the random effects for known parameters and must maximize the generalized log-likelihood criterion:
Finding a maximum for is equivalent to the problem of finding a maximizer for in the first term of the summation of the proposed GLL-criterion:
Reshaping leads to:
Now we derive the expression with respect to and set it to 0 in order to find the maximizer
The solution of the maximization problem is given by . Note, for , the optimality solution resembles the BLUP.
Appendix B: additional simulation results and model-diagnostics
|ictpc||Total household income per capita|
|escol_rel_hog||Average relative amount of schooling standardized|
|by age and sex of household members|
|bienes||Availability of goods in the household|
|jaesc||Average years of schooling of persons in the household|
|jnived||Formal education of the head of the household|
|actcom||Assets in the household|
|pcpering||Percentage of income earners in the household|
|jexp||Years of working experience of the head of the household|
|pcpering_2||Number of income earners in the household by household size|
|pcocup||Percentage of people employed in the household|
|Areas||Method||Min||1st Qu.||Median||Mean||3rd Qu.||Max|