1 Introduction
Having accurate and detailed information on social and economic conditions, summarised by appropriate indicators, is imperative for the efficient implementation of policies. The term detailed is used to signify information that extends beyond aggregate levels into highly disaggregated geographical and other domains (e.g. demographic groups). The term accurate refers to information that is estimated with appropriate level of precision and is comparable over space and time. Simply analysing data from national sample surveys is not enough for achieving the dual target of accurate and detailed information. This is mainly due to the reduction of sample sizes as the level of required detail increases. The achievement of this dual target demands appropriate modelbased methodology collectively referred to as Small Area Estimation (SAE).
SAE methods can be broadly divided in two classes: first, arealevel models (Fay_Heriot1979) assume that only aggregated data for the survey and for the auxiliary information is available. Second, unitlevel models (Battese_etal1988)  further labelled as BHF  require access to the survey and to the auxiliary information on microlevel. A versatile extension of the BHF model is the EBPapproach by Molina_rao2010. The EBP is capable to estimate arealevel means as well as other linear and nonlinear indicators. Both classes (arealevel and unitlevel models) are predominantly regressionbased models, where the hierarchical structure of observations is modelled by random effects. These linear mixed models (LMM) assume normality of random effects and error terms. Focusing on social and economic inequality data, the required assumptions for LMMs hardly meet empirical evidence. JiangRao2020 remind, that optimality results and predictive performance in modelbased SAE are inevitably connected to the validity of model assumptions. Without theoretical and practical considerations regarding improperly met assumptions, estimates are potentially biased and mean squared error (MSE) estimates unreliable.
One strategy to prevent modelfailure, is the assurance of normality by transforming the dependent variable (sugasawa2017transforming; Rojas_etal2019). For instance, Rojas_etal2019 generalize the EBP with a datadriven transformation on the dependent variable, such that normality assumptions can be met in transformed settings. Further details on how to obtain the mostlikely transformation parameter that improves the performance of unitlevel models are available in Rojas_etal2019 and sugasawa2019adaptively or from a more applied perspective in Tzavidis_etal2018. Apart from transformation strategies, another alternative is the use of models with less restrictive (parametric) assumptions to avoid modelfailure. For instance, DialloRao2018 and Graf_etal2019 formulate the EBP under more flexible distributional assumptions. Semi or nonparametric approaches for the estimations of arealevel means were investigated among others by Opsomeretal2008. Opsomeretal2008 use penalized splines regression, treating the coefficients of spline components as additional random effects within the LMM setting.
A distinct methodological option to avoid the parametric assumptions of LMMs are the class of machine learning methods. These methods are not only limited to parametric models and ‘learn’ predictive relations from data, including higher order interactions between covariates, without explicit model assumptions
(Hastie_etal2009; Varian2014). Among the broad class of machine learning methods, we focus on treebased models and particularly on random forests (Breiman2001)because they exhibit excellent predictive performance in the presence of outliers and implicitly solve problems of modelselection
(Biau_Scornet2016). In general, the predictive perspective of (treebased) machine learning methods conceptually transfers straight forward to the methodology of unitlevel SAEmodels: survey data is used to construct a model with predictive covariates. Subsequently, auxiliary information from a supplementary data source (usually census, register or administrative data) is utilized to obtain predictions over sampled and nonsampled areas. From a machine learning perspective, the survey data serves as an implicit training set to construct a proper model, while supplementary data is used to predict final indicators. Nevertheless, JiangRao2020 claim, that results from machine learning methods in SAE are harder to be interpreted and justified by SAEpractitioners, compared to LMMalternatives.We aim to fill this gap by providing a consistent framework enabling a coherent use of treebased machine learning methods in SAE. In particular, we proposes a nonlinear, datadriven, and semiparametric alternative for the estimation of arealevel means by using mixed effects random forests in the methodological tradition of SAE. Random effects are used to account for the hierarchical structure of the survey data. A nonparametric bootstrap estimator is introduced for assessing the uncertainty of the arealevel estimates. Using design and modelbased simulations, we highlight strengths and weaknesses of random forests in the context of SAE, in comparison to existing (or ‘traditional’) SAEmethods. Thus, this paper aims to contribute towards the trend of diversifying the modeltoolbox for SAEpractitioners and researchers, while simultaneously respecting the methodological and structural nature of SAE.
The general idea of treebased methods in SAE is not entirely new. Anderson_etal2014 use districtlevel data from Peru to juxtapose the performance of LMMbased and treebased methods for estimating population densities. Bilton_etal2017
use classification trees for categorical variables to incorporate auxiliary information from administrative data to survey data on householdpoverty in Nepal. For a continuous variable
DeMolinerGoga2018 estimate mean electricity consumption curves for subpopulations of households in France by using methods of LMMs and regressionbased trees. MoConvilleetal2019 propose a regressiontree estimator for finite‐population totals, which can be viewed as an automaticallyselected post‐stratification estimator. Dagdougetal2020 analyse theoretical properties of random forest in the context of complex survey data. Mendez2008 provides theoretical and empirical considerations for using random forests in the context of SAE and compares their performance with ‘traditional’ unitlevel LMMs for the estimation of arealevel means. Although we share the general idea of Mendez2008, the approach of this paper differs in several ways: first of all, we leave the random forests algorithm (Breiman2001)unchanged and explicitly estimate random effects to account for the hierarchical structure of the data. Secondly, the proposed framework of this paper is more flexible and potentially extendable to model more complex hierarchical dependency structures as well as spatial and temporal correlations. Additionally, the extension to other methods of machine learning is possible, such as support vector machines or gradient boosted trees.
The paper is organized as follows: Section 2 provides a methodological introduction to random forests and introduces Mixed Effects Random Forests (MERF) based on Hajjem2014 as a method that effectively amalgamates random forests and the possibility to account for hierarchical dependencies of unitlevel observations. Additionally, we motivate a general unitlevel mixed model, treating LMMs in SAE as special cases. In Section 2.3, we discuss the construction of arealevel meanestimates. Random forests promote the flexibility of predictive models, but their lack of distributional assumptions complicates inferences. As a result, Section 3 proposes a nonparametric bootstrapscheme for the estimation of the arealevel MSE. In Section 4, we use modelbased simulations under complex settings to extensively discuss and compare the performance of the proposed method for point and MSEestimates. We claim MERFs to be a valid alternative to existing methods for the estimation of SAEmeans. In Section 5, we use household income data of the Mexican state Nuevo León to estimate arealevel averages and corresponding uncertainty estimates. We highlight modelling and robustness properties of our proposed methods. Section 5.3 proceeds with a designbased simulation, which asses the quality of results in the application of Section 5.2. Furthermore, the designbased simulation contributes to a genuine demonstration of properties and advantages of MERFs in the context of SAE. Section 6 concludes and motivates further research.
2 Theory and method
In this section we propose a flexible, datadriven approach using random forests for the estimation of arealevel means in the presence of unitlevel survey data. The method requires a joint understanding of treebased modelling techniques and concepts of SAE. We review the basic theory of random forest and discuss modifications to ensure their applicability to hierarchical data and subsequently to applications of SAE.
2.1 Review of random forests
Random forests combine individual decision trees
(Breiman_etal1984)to improve their joint predictive power, while simultaneously reducing their prediction variance
(Biau_Scornet2016; Breiman2001). Breiman2001 extends his idea of Bagging (Breiman1996bagging)  which combines predictions from single trees through a bootstrap and aggregation procedure  to random forests that apply bootstrap aggregation on decorrelated trees. Note that the two tuning parameters of random forests are the number of trees (controlling the number of bootstrap replications) and the number of variables to be selected as candidates for each split (controlling the degree of decorrelation). Because the forest is a combination of decorrelated trees, each aiming to minimize the prediction MSE, the optimal estimator for the random forest regression function also minimizes the pointwise MSE. The minimizer under squared error loss in the regression context, is given by the conditional mean of target variable given the data (Efron_Hastie2016; Wager_Athey2018).Random forests captivate with a lack of assumptions such as linearity or the distributional specification of model errors, however, observations are assumed to be independent. Applications of SAE are characterized by the use of hierarchical data. Ignoring the correlation between observations, generally results in inferior pointpredictions and inferences. LMMs capture the dependencies between observations by random effects, while effects between covariates are modelled by linear fixed effects, resulting in an additive model of both terms. In the context of treebased methods, Sela_Simonoff2012 propose a semiparametric mixed model consisting of a random effects part and a fixed effects nonparametric treemodel. Hajjem_etal2011 propose a similar approach under the label of mixed effect regression trees (MERT). As the superior performance of random forests over regression trees transfers to dependent data, Hajjem2014 replace the fixed effects part in MERTs by a random forest, leading to mixed effects random forests (MERF). We scrutinize this approach and propose a general semiparametric unitlevel mixed model combining the flexibility of treebased models with the structural advantages of linear mixed models in the next subsection.
2.2 Mixed effects random forests
We assume a finite population which is divided into disjunct areas , with population sizes , where specifies the areas and defines the population size. We assume to have a sample from this population. The number of sampled observations in area is given by , where the sample size is denoted by . In each sampled area we obtain individual observations ranging from .
We define the metric target variable for area as a vector of individual observations . Covariates are captured in the matrix of , where defines the number of covariates. defines the matrix of areaspecific random effect specifiers, where describes the dimension of random effects. is the vector of random effects for area . is the vector of individual error terms. Observations between areas are assumed to be independent and and
are mutually independently normally distributed with the same variancecovariance matrix
for random effects of each area and for individual errors. A joint notation for all areas is as follows:The goal is to identify a relation between covariates and the target variable , in order to predict values for nonsampled observations utilizing available supplementary covariates from census or register information across areas. We state a model consisting of two major parts: a fixed part and a linear part capturing dependencies by random effects. In the following, can be any parametric or nonparametric function that expresses the conditional mean of target variable given covariates :
(1) 
where
Note that for each area the following model holds:
(2) 
The covariance matrix of the observations is given by the block diagonal matrix , where . We introduce model (1) in general terms to potentially allow for modelling of complex covariance and dependency structures. However, for the rest of the paper we assume that correlations arises only due to betweenarea variation, i.e. for all areas . Note that the already mentioned LMM proposed by Battese_etal1988 for estimating arealevel means results as a special case of (1) by setting to be the linear model , with regression parameters . Defining as a random forest, results in the MERFapproach proposed by Hajjem2014, which is the preferred specification throughout the rest of the paper.
Before we continue, we want to clarify consequences of distributional assumptions in (1) that mainly address the linear part of the model. The unitlevel errors are assumed to follow . However, the assumption of normality does not affect the properties of the fixed part
and we do not require residuals to be normally distributed for the application of our proposed method. Nevertheless, for the random components part, we require a proper likelihood function to ensure that the adapted expectationmaximization (EM) algorithm (see below for further details) for parameter estimates converges towards a local maximum within the parameter space. A normalitybased likelihood function is exploited, as it has two important properties: firstly, it facilitates the estimation of random effects due to the existence of a closedform solution of the integral of the Gaussian likelihood function. Secondly, the maximum likelihood estimate for the variance of unitlevel errors is given by the mean of the unitlevel residual sum of squares. The estimation of the random effects could be also done in a nonparametric way by using discrete mixtures
(Marino2016; Marino2019). However, the modification towards a fully nonparametric formulation of model (1) is subject to further research.For fitting the model (1) we use an approach reminiscent of the EMalgorithm similar to Hajjem2014. In short, the MERFalgorithm subsequently estimates a) the forest function, assuming the random effects term to be correct and b) estimates the random effects part, assuming the OutofBagpredictions (OOBpredictions) from the forest to be correct. OOBpredictions utilize the unused observations from the construction of each forest’s subtree (Breiman2001; Biau_Scornet2016). The proposed algorithm is as follows:

Initialize and set random components to zero.

Set . Update and :


Estimate using a random forest with dependent variable and covariates . Note that is the same function for all areas .

Get the OOBpredictions .

Fit a linear mixed model without intercept and restricted regression coefficient of 1 for :

Extract the variance components and and estimate the random effects by:


Repeat Step (2) until convergence is reached.
The convergence of the algorithm is assessed by the marginal change of the modified generalized loglikelihood (GLL) criterion:
In the linear case with , and for given variance components and , the maximization of the GLLcriterion is equivalent to the solution of socalled mixed model equations (Wu_Zhang2006)  leading to best linear unbiased predictor (BLUP): . For random forests, the corresponding estimator for known parameters and is given by:
(3) 
Mathematical details of the derivations are provided in Appendix A. This result is in line with capitaine_etal2021, claiming that is obtained by taking the conditional expectation given the data and subsequently can thus be considered as the BLUP for the linear part of model (1).
The estimation of variance components in Step 2 (d) for and is obtained by taking the expectation of maximum likelihood estimators given the data. Although is a naive estimator within the discussed framework, it cannot be considered as a valid estimator for the variance of the unitlevel errors . Breiman2001 maintains that the sum of squared residuals from OOBpredictions are a valid estimator for the squared prediction error of new individual observations. However, as an estimator of the residual variance under the model, is positively biased, as it includes uncertainty regarding the estimation of function . Following Mendez_Lohr2011 we use a biasadjusted estimator for the residual variance from a random forest model (1) using a bootstrap biascorrection. The essential steps to obtain the corrected residual variance are summarized as follows:

Use the OOBpredictions from the final model after convergence of the algorithm.

Generate bootstrap samples , where the values are sampled with replacement from the centred marginal residuals .

Recompute using a random forest with as dependent variable.

Estimate the correctionterm by:
The biascorrected estimator for the residual variance is then given by:
(4) 
2.3 Predicting smallarea averages
The MERFmodel (1) predicts the conditional mean on individual level of a metric dependent variable given unitlevel auxiliary information. In the context of SAE, we are not interested in predictions on individual level, but in estimating indicators such as arealevel means or arealevel totals (Rao_Molina2015). Thus, we assume the same structural simplifications as the LMM proposed by Battese_etal1988 for estimating arealevel means throughout the paper, i.e. , is a designmatrix of areaintercept indicators, is a vector of random effects, and variancecovariance matrix for random effects simplifies to .
Firstly, we use the fact that random forest estimates of the fixed part express the conditional mean on unitlevel. We calculate the meanestimator for each area based on available supplementary data sources (usually census or administrative data) by:
Secondly, we exploit the result (3) that is the BLUP for the linear part of the model (1). Therefore, the proposed estimator for the arealevel mean is given by:
(5) 
In cases of nonsampled areas, the proposed estimator for the arealevel mean reduces to the fixed part from the random forest:
3 Estimation of uncertainty
The assessment of uncertainty of arealevel indicators in SAE is crucial to analyse the quality of estimates. The arealevel MSE is a conventional measure fulfilling this goal, but its calculation is a challenging task. For instance, for the unitlevel LMM with block diagonal covariance matrices (Battese_etal1988), the exact MSE cannot be analytically derived with estimated variance components (Gonzalez_etal2008; Rao_Molina2015) and only partlyanalytical approximations are available (Prasad_Rao1990; Datta_Lahiri2000). An alternative to estimate uncertainty of the arealevel indicators are bootstrapschemes (Hall_Maiti2006; Gonzalez_etal2008; Chambers_Chandra2013). In contrast, general statistical results for inference of random forests are rare, especially in comparison to the existing theory of inference using LMMs. The theoretical background of random forests grows, but mainly aims to quantify the uncertainty of individual predictions (Sexton_Laake2009; wager_etal2014; Wager_Athey2018; Athey_etal2019; Zhang2019). The extension of recent theoretical results, such as conditions for the consistency of unitlevel predictions (Scornet_etal2015) or their asymptotic normality (Wager_Athey2018), towards arealevel indicators is a conducive topic for further research.
In this paper, we propose a nonparametric random effect block (REB) bootstrap for estimating the MSE of the introduced arealevel estimator given by equation (5). We aim to capture the dependencestructure of the data as well as the uncertainty introduced by the estimation of model (1). Our bootstrapscheme builds on the nonparametric bootstrap introduced by Chambers_Chandra2013. The proposed REB bootstrap has two major advantages: firstly, empirical residuals depend only on the correct specification of the mean behaviour function of the model, thus the REB setting is lenient to specification errors regarding the covariance structure of the model. Secondly, the bootstrap within blocks ensures that the variability of residuals within each area is captured. We scale and centre the empirical residuals by the biascorrected residual variance (4) in order to eliminate the uncertainty due to the estimation of from the naive residuals. The steps of the proposed bootstrap are as follows:

For given calculate the vector of marginal residuals and define .

Using the marginal residuals , compute level2 residuals for each area by
and indicates the vector of level2 residuals.

To replicate the hierarchical structure we use the marginal residuals and obtain the vector of level1 residuals by . The residuals are scaled to the biascorrected variance (4) and centred, denoted by . The level2 residuals are also scaled to the estimated variance and centred, denoted by .

For :

Sample independently with replacement from the scaled and centred level1 and level2 residuals:

Calculate the bootstrap population as and calculate the true bootstrap population area means as for all .

For each bootstrap population draw a bootstrap sample with the same as the original sample. Use the bootstrap sample to obtain estimates and as discussed in Section 2.2.

Calculate arealevel means following Section 2.3 by


Using the bootstrap samples, the MSE estimator is obtained as follows:
4 Modelbased simulation
This section marks the first step in the empirical assessment of the proposed method. The modelbased simulation juxtaposes point estimates for the arealevel mean from the mixed effects random forest model (1) with several competitors. In particular, we study the performance of MERFs compared to the BHFmodel (Battese_etal1988), the EBP (Molina_rao2010) as well as the EBP under datadriven BoxCox transformation (EBPBC) by Rojas_etal2019. The BHFmodel serves as an established baseline for the estimation of arealevel means and the EBP and the EBPBC conceptually build on the BHFmodel. Differences in the performance of the EBP and the EBPBC highlight advantages of datadriven transformations, while differences in the performance of the linear competitors and the MERF indicate advantages of semiparametric and nonlinear modelling. Overall, we aim to show, that our proposed methodology for point and uncertainty estimates performs comparably well to ‘traditional’ SAE methods and has comparative advantages in terms of robustness against modelfailure.
The simulationsetting is characterized by a finite population of size with disjunct areas of equal size . We generate samples under stratified random sampling, utilizing the small areas as stratas, resulting in a sample size of . The areaspecific sample sizes range from to sampled units with a median of and a mean of . The sample sizes are comparable to arealevel sample sizes in the application in Section 5 and can thus be considered to be realistic.
We consider four scenarios denoted as Normal, Interaction, NormalPar, InteractionPar and repeat each scenario independently
times. The comparison of competing modelestimates under these four scenarios allows us to examine the performance under two major dimensions of modelmisspecification: Firstly, the presence of skewed data delineated by nonnormal errorterms and secondly, the presence of unknown nonlinear interactions between covariates. Scenario
Normal provides a baseline under LMMs with normally distributed random effects and unitlevel errors. As modelassumptions for LMMs are fully met, we aim to show, that MERFs perform comparably well to linear competitors in the reference scenario. Scenario Interaction shares its errorstructure with Normal, but involves a complex model including quadratic terms and interactions. This scenario portrays advantages of semiparametric and nonlinear modelling methods protecting against modelfailure. Working with inequality or income data, we often deal with skewed target variables. Thus, we use the Pareto distribution to mimic realistic income scenarios. Scenario NormalPar uses the linear additive structure of LMMs and adds Pareto distributed unitlevel errors. The resulting scenario, including a skewed target variable, is a classical example promoting the use of transformations assuring that assumptions of LMMs to be met. Finally, scenario InteractionPar combines the two discussed dimensions of model misspecification, i.e. a nonGaussian errorstructure with complex interactions between covariates. We chose this scenario to emphasize the ability of MERFs to handle both complications simultaneously. Further details on the datagenerating process for each scenario are provided in Table 1.Scenario  Model  

Normal  
Interaction  
NormalPar  
InteractionPar 
We evaluate point estimates for the arealevel mean by the relative bias (RB) and the relative root mean squared error (RRMSE). As qualitycriteria for the evaluation of the MSEestimates, we choose the relative bias of RMSE (RBRMSE) and the relative root mean squared error of the RMSE:
where is the estimated mean in area based on any of the methods mentioned above and defines the true mean for area in simulation round . is estimated by the proposed bootstrap in Section 3 and is the empirical root MSE over replications.
For the computational realization of the modelbased simulation, we use R (R_language). The BHF estimates are realized from the saepackage (Molina_Marhuenda:2015) and the emdipackage (Kreutzmann_etal2019) is used for the EBP as well as the EBP under the datadriven BoxCox transformation. For estimating the proposed MERFapproach, we use the packages randomForest (Liaw_Wiener2002) and lme4 (Bates_etal2015). We monitor the convergence of algorithm introduced in Section 2.2 with a precision of in relative difference of the GLLcriterion and set the number of splitcandidates to , keeping the default of trees for each forest.
We start with a focus on the performance of point estimates. Figure 1 reports the empirical RMSE of each method under the four scenarios. As expected, in the Normal scenario, the BHF and the EBP perform on the same level and outperform the MERF estimator. The EBP with a datadriven transformation (EBPBC) leads to similar results compared to the BHF and EBP. This shows that the datadriven transformation works as expected. A similar pattern appears in the results from the NormalPar scenario, except that the EBPBC reaches a lower overall RMSE due to its property of datadriven transformation and subsequently improved estimation under skewed data. As anticipated, a comparison of the performance of the MERF between the Normal and the NormalPar scenario indicates, that the MERF exhibits robust behaviour under skewed data and subsequently regarding violations of the normal distribution of errors. For complex scenarios, i.e. Interaction and InteractionPar, point estimates of the proposed MERF outperform the SAE methods based on LMMs. Nevertheless, the EBPBC performs better in terms of lower RMSE values compared to the BHF and the EBP in both interaction scenarios. Overall, the results from Figure 1 indicate that the MERF performs comparably well to LMMs in simple scenarios, and outperforms ‘traditional’ SAEmodels in the presence of unknown nonlinear relations between covariates. Additionally, the robustness against modelmisspecification of MERFs holds if distributional assumptions for LMMs are not met, i.e. in the presence of nonnormally distributed errors and skewed data. Table 2 reports the corresponding values of RB and RRMSE for our discussed point estimates. The RB and the RRMSE from the MERFmethod attest a competitively low level for all scenarios. Most interestingly, in complex scenarios (Interaction and InteractionPar), a familiar result regarding the statistical properties of random forests appears: the RB is higher compared to the LMMbased models, but the enlarged RB is rewarded by a lower RRMSE for point estimates.
Normal  Interaction  NormalPar  InteractionPar  
Median  Mean  Median  Mean  Median  Mean  Median  Mean  
RB  
BHF    
EBP    
EBPBC  
MERF  
RRMSE  
BHF  
EBP  
EBPBC  
MERF 
We scrutinize the performance of our proposed MSEestimator on the four scenarios, examining whether the observed robustness against modelmisspecification due to unknown complex interactions between covariates or skewed data for point estimates, also holds for our nonparametric bootstrapscheme. For each scenario and each simulation round, we choose the parameter of bootstrap replications . From the comparison of RBRMSE among the four scenarios provided in Table 3, we infer, that the proposed nonparametric bootstrapprocedure effectively handles scenarios that lead to modelmisspecification in cases of (untransformed) LMMs. This is demonstrated by essentially unbiasedness in terms of mean and median values of RBRMSE over areas of the MSEestimator under all four scenarios: independently, whether the data generating process is characterized by complex interactions (Interaction), nonnormal error terms (NormalPar) or a combination of both problems (InteractionPar). Additional information on the RBRMSE regarding the nonparametric bootstrap is available in Figure 7 in the Appendix B.
Normal  Interaction  NormalPar  InteractionPar  

Median  Mean  Median  Mean  Median  Mean  Median  Mean  
RBRMSE      
RRMSERMSE 
From the results of Table 3 and the subsequent discussion, we cannot directly infer the areawise tracking properties of the estimated RMSE against the empirical RMSE over our simulation rounds. Thus, Figure 2 provides additional intuition on the quality of our proposed nonparametric MSEbootstrap estimator. Given the tracking properties in all four scenarios, we conclude that our MSEestimates strongly correspond to the empirical RMSE. Furthermore, we do not observe systematic differences between the estimated and empirical MSEestimates regarding different surveysample sizes.
5 Application: Estimating household income for Mexican municipalities
In this section, we discuss the performance of our proposed method in the context of a genuine SAE example. Concretely, we apply the MERFmethod proposed in Section 2.2 to estimate domainlevel average household income for the Mexican state Nuevo León. Section 5.1 describes the data and Section 5.2 reports results. We end our empirical analysis by an additional designbased simulation enabling a profound discussion on the quality of point and MSEestimates in Section 5.3.
5.1 Data description
Income inequality in Mexico is a research topic of timeless importance, particularly regarding the effectiveness of developmental and social policies (lambert_hyunmin2019). Although the level of income inequality in Mexico is comparable to other Latin American countries, it is one of the highest among other OECD countries (Oecd_21). Analysing average national income values is a common practice, but constitutes an inappropriate measure to monitor the efficacy of regional policy measures. Besides detailed disaggregated information, also suitable statistical methods are needed to quantify local developments. For the following application, we break down regional differences in average household per capita income of one of 32 Mexican states. Nuevo León is located in the NorthEast of Mexico and according to the (subnational) Human Development Index (HDI), it is one of the most developed states of Mexico (smits_Permanyer2019). Nevertheless, the distribution of individual household income in Nuevo León is unequal and thus highly skewed. For instance, the Ginicoefficient of household income is comparable to the total Gini of Mexican household disposable income from 2012 which was (Oecd_21).
We use data from 2010 provided by CONEVAL (Consejo Nacional de Evaluación de la Política de Desarrollo Social), combining the Mexican household income and expenditure survey (Encuesta Nacional de Ingreso y Gastos de los Hogares, ENIGH) with a sample of census microdata by the National Institute of Statistics and Geography (Instituto Nacional de Estadística y Geografía). The dataset comprises income and sociodemographic data, equally measured by variables that are part of the survey as well as the census data. The target variable for the estimation of domainlevel average household income in Section 5.2, is the total household per capita income (ictpc, measured in pesos), which is available in the survey but not in the census.
Nuevo León is divided into municipalities. While the census dataset in our example comprises information on households from all municipalities, the survey data includes information on households from municipalities, ranging from a minimum of to a maximum of households with a median of households. This leaves administrative divisions to be outofsample. Table 4 provides details on sample and census data properties.
Total  Insample  Outofsample  
51  21  30  
Min.  1st Qu.  Median  Mean  3rd Qu.  Max.  
Survey area sizes  5.00  14.00  27.00  68.33  79.00  342.00 
Census area sizes  76.00  454.50  642.00  1075.45  872.50  5904.00 
With respect to the designbased simulation in Section 5.3, we emphasize that we are in the fortunate position of having a variable that is highly correlated with the target variable ictpc in the application and that is available in the survey and in the census dataset: the variable inglabpc measures earned per capita income from work. Although inglabpc deviates from the desired income definition for our approach  as it covers only one aspect of total household income  it is effective to evaluate our method under a designbased simulation in Section 5.3. Furthermore, the designbased simulation implicitly assess the quality of our empirical results from Section 5.2 .
Using data from Nuevo León for the estimation of domainlevel income averages, is an illustrative and realistic example and imposes several challenges on the proposed method of MERFs: first of all, about percent of households in the surveysample are located in the capital Monterrey. Secondly, there exist more outof sample domains than insample domains. Moreover, we are confronted with households reporting zeroincomes. Our intention in choosing this challenging example for the application part in Section 5.2 and the following designbased simulation in Section 5.3 are simple: we aim to show, that our proposed approaches for point and uncertainty estimates demonstrate a valid alternative to existing SAEmethods and are predominantly applicable for cases where ‘traditional’ methods perform poorly or even fail. Additionally, we aim to provide a clearcut presentation and empirical assessment of MERFs for SAE, which requires a transparent discussion of advantages and potential weaknesses in demanding realworld examples.
5.2 Results and discussion
Direct estimates for average total household per capita income for Nuevo León are possible for 21 out of 51 domains. The use of modelbased SAE methods incorporating covariate census data will not only lead to estimates for the remaining outofsample areas, but correspondingly improve the overall quality of estimates (Tzavidis_etal2018). As variable ictpc is highly skewed, we deduce potential issues of modelmisspecification and suggest the use of the EBPBC and the MERF. Given the theoretical discussion and the results in the modelbased simulation in Section 4, we infer that the EBPBC and the proposed method of the MERFs for SAE effectively handle nonnormally distributed data. Moreover, we are particularly interested in differences between these two diverse SAEmodels in the context of realworld applications. Figure 3 maps results from direct estimates, the MERF and the EBPBC. Obviously, the modelbased estimates from the MERF and the EBPBC expand the perspective of regional disparities in average total household income per capita towards nonsampled regions. Furthermore, we identify three distinct clusters of income levels from our results in Figure 3 that had not been observable from the mapping of direct estimates: a low income cluster in the South of Nuevo León, a very high income cluster in the metropolitan area of the capital Monterrey and a group of middleincome areas between the North and the South of the state. This finding illustrates, the potential of modelbased techniques to highlight patterns of regional income disparities and enable the mapping of reliable empirical evidence. Given the information provided by the three maps, we do not report major differences between the point estimates of MERFs and the EBPBC.
Apart from mapping empirical results of domain averages, we are mainly interested in quality criteria, such as the coefficients of variation (CV) and the details of modelspecification for the EBPBC and the MERF. To obtain estimates of variances for the direct estimates, we use the calibrated bootstrap method (Alfons_Templ2013) as provided in the R package emdi (Kreutzmann_etal2019). For the model of the datadriven EBPBC, we follow the approach of Rojas_etal2019 and use the Bayesian Information Criterion (BIC) to identify valid predictors for the target variable of ictpc. The resulting workingmodel includes variables determining occupation, sources of income, the socioeconomic level and educational aspects of individual households. The identification of predictive covariates for MERFs highlights a conceptual difference to LMMbased methods. Due to the properties of the random forest algorithm (Breiman2001), random forests perform an implicit variable selection. The selected model for fixed effects in our case is characterized by an Rsquared of about percent. The dilemma between predictive precision and the interpretability of random forest models can be mitigated by concepts such as variable importance plots (Greenwell_etal2020) or an analysis of partial dependence for influential covariates (Greenwell_2017). We provide graphical representations in Figure 8 and Figure 9 in the Appendix B. We monitor the convergence of the proposed MERF algorithm under a precision of in relative difference of the GLL criterion and keep the default of trees. A parameter optimization based on 5fold crossvalidation on the original surveysample advices the use of variables at each split for the forest. For the MSE bootstrapprocedure, we use .
Figure 4 reports corresponding CVs for in and outofsample domains. We observe a significant improvement for insample CVs of the EBPBC and the MERF compared to the CVs for direct estimates. CVs of MERFs are slightly lower in median terms than the results for the EBPBC. However, there exists one outlying area for MERFs. Going into details, the corresponding area of General Zaragoza features no obvious dataspecific irregularities, such as extremely low sample size. Nevertheless, General Zaragoza is one of the poorest regions according to our analysis. In the designbased simulation in Section 5.3, we will pay special attention to differences between MERFs and the EBPBC regarding their ability to handle comparably extreme estimates given a broad range of relatively high and relative low income areas.
Regarding the CVs for outofsample areas, we discover an evident advantage of CVs from our proposed MERF approach. From the scrutiny of individual CV values it remains unclear, whether the asset of improved results from MERFs roots in superior point estimates for domainlevel averages or its relatively lower MSEestimates. Figure 5 compares direct estimates to the modelbased estimates for in and outofsample domains. Apparently, there exist no systematic differences between the estimates from the EBPBC and the MERF. Thus, it appears as if the variance of MERF predictions is generally lower. This conjecture is in line with the theoretical properties of random forests (Breiman2001; Biau_Scornet2016). Insample areas in Figure 5 are sorted by surveysample sizes. In comparison to the direct estimates, predicted averages of the EBPBC as well as of the MERF seem less extreme. The obvious irregularity in terms of highincome, is a distinct part of the Monterrey metropolitan area: San Pedro Garza García registers several headquarters of national and international corporations. This economic peculiarity apparently transfers to the income of its citizenship. Figure 3, underlines the existence of an apparent highincome cluster in this region. Overall, it is interesting to observe how reliable estimates on fine spatial resolution unveil patterns of regional income segregation. Our proposed method of MERFs provides useful results with remarkably higher accuracy than direct estimates and the EBPBC for most of outofsample domains. The following designbased simulation will strengthen the reliability of results and enable an indepth discussion of our methods for point and MSEestimates.
5.3 Evaluation using designbased simulation
The designbased simulation allows to directly juxtapose the performance of the proposed MERFapproach to existing SAEmethods for the estimation of arealevel means based on empirical data. In this sense, the designbased simulation adds not only insights to the results from the modelbased simulation in Section 4, but also evaluates results from the example in the previous Section 5.3. We focus on arealevel meanestimates of household income from work inglabpc in the Mexican state Nuevo León. As we use the same data with a different target variable, sample and census data properties are similar to the previous example with details provided in Table 4. Implementing the designbased simulation, we sample independent samples from the fixed population of our census dataset. Each pseudosurveysample mirrors the characteristics of the original survey, as we keep the number of insample households similar to the original sample sizes and abstain from sampling outofsample municipalities. As a result, we use equally structured pseudosurveysamples with equal overall sample size. True values, are defined as the domainlevel averages of household incomes from work in the original census.
We consider the same methods as in the modelbased simulation in Section 4. Comparably to Section 5.2, we use the same workingmodel for the BHF, the EBP and the EBPBC and assume it to be fixed throughout the designbased simulation. For the EBPBC and the MERF, we keep the parameters as already discussed in Section 5.2.
Total  Insample  Outofsample  
Median  Mean  Median  Mean  Median  Mean  
RB  
BHF  
EBP  
EBPBC  
MERF  
RRMSE  
BHF  
EBP  
EBPBC  
MERF 
The discussion of results, starts with an investigation into the performance of point estimates. Figure 6 reports the average RMSE of the arealevel meanestimates for Nuevo León in total and with details on the in and outof sample areas. The corresponding summary statistics for Figure 6 are given in Table 8 in Appendix B. Regarding the total of areas, we observe no remarkable difference in the performance of the BHF and the EBP, whereas the EBPBC has lower RMSE on average. The MERF point estimates indicate the lowest RMSEs among all areas resulting in an more than percent improvement compared to the BHF. Referring to the RMSE for insample areas, we see two different ways how the EBPBC and the MERF deal with high and unbalanced variation in our true values for certain areas, ranging from to about pesos: overall, the MERF deals best in modelling the complex survey data and produces highly accurate estimates for the majority of insample areas. A closer look at the results, reveals however, that higher RMSE values due to overestimation mainly occur in two areas, both characterised by a very low level of income ( and pesos respectively). In contrast, we observe the insample behaviour of the EBPBC, which clearly opposes its superior overall performance. The EBPBC appears to balance extreme estimates by producing slightly worse estimates for each individual insample areas, than allowing for individually inferior estimates for specific ‘outlier’areas. This behaviour is conceptually rooted in its datadriven transformationapproach. Nevertheless, this property enables the EBPBC to identify a model, providing stable and precise estimates on the majority of areas, especially the 30 nonsampled areas. Given the datascenario of Nuevo León, the performance on the outofsample areas, delineates each method’s quality and stability. In this case, the EBPBC and MERF outperform the ‘traditional’ methods (BHF and EBP) in terms of lower RMSE. One advantage of the MERF is its adaptability and implicit modelselection that is rewarded in the presence of complex datascenarios.
The findings from Figure 6, are strengthened by a discussion of mean and median values of RB and RRMSE in Table 5. Referring to all areas, the RB of the datadriven method of EBPBC and the MERF is smaller in median values than the RB of BHF and the EBP. Respectively, the MERF shows the lowest median value of RB while meanvalues lie in the same range with competing methods. The obvious difference between mean and median values, indicates the previously discussed existence of inferior estimates for specific regions due to the empirical properties of our underlying data. For the insample areas, the MERF performs superior to competing methods regarding the median values of RB and RRMSE. The close relation between the mean and median values of RB for the EBPBC highlight the mentioned balancingproperty of the EBPBC. For the majority of areas in the modelbased simulation, i.e. the nonsampled areas, the EBPBC as well as the MERF exhibit a comparatively low level of median RB. Especially the MERF captivates by the lowest values in mean and median for RB and RRMSE compared to its competitors.
Finally, we focus on the performance of the proposed nonparametric MSEbootstrap procedure. While, the modelbased simulation in Section 4 indicates unbiasedness of the proposed bootstrapscheme under all four scenarios, our results from the designbased simulation require a deeper discussion. Table 6 reports the results of RBRMSE and the RRMSERMSE for the corresponding estimates. Figure 10 in the Appendix B visualizes details from Table 6. First of all, the values of RRMSERMSE are comparable to the most complex scenario in the modelbased scenario in Section 4. The RBRMSE for the insample areas indicates unbiasedness in median terms and an acceptable overestimation regarding the mean RBRMSE. For outofsample areas, we face a moderate underestimation regarding the median value and overestimation according to mean values. Nevertheless, Figure 10 in Appendix B reveals, that the mixed signal between mean and median in Table 6 is explained by a balanced mix of under and overestimation. Overall, the expectations towards the MSEbootstrap procedure, given the challenging conditions of this designbased simulation, are met. Especially, the results for insample areas, combined with insights from the modelbased simulation, indicate a solid and reliable performance of the proposed nonparametric bootstrap procedure. Although, the RBRMSE for all areas is driven by the results from outofsample areas, the median RBRMSE is acceptable. Apparently, the MSEestimates mirror the high variation in sample sizes paired with high and disproportional variation of highincome and lowincome regions between the insample and
outofsample areas. From an applied perspective, the MSEestimates for outofsample areas are nevertheless practicable for the construction of confidence intervals, with a median coverage of
.Total  Insample  Outofsample  

Median  Mean  Median  Mean  Median  Mean  
RBRMSE      
RRMSERMSE 
6 Concluding remarks
In this paper, we explore the potential of treebased machine learning methods for the estimation of SAEmeans. In particular, we provide a solid framework easing the use of random forests for regression within the existing methodological framework of SAE. We highlight the potential of our approach to meet modern requirements of SAE, including the robustness of random forests against modelfailure and the applicability for highdimensional problems processing Big Data sources. The methodological part focusses on the MERFprocedure (Hajjem2014) and implicitly discusses a semiparametric unitlevel mixed model, treating LMMbased SAEmethods, such as the BHF and the EBP, as special cases. The model is fit by an algorithm resembling the EMalgorithm, allowing for flexibility in the specification to model fixed effects as well as randomeffects. The proposed point estimator for arealevel means is complemented by the nonparametric MSEbootstrapscheme, building on the REBbootstrap by Chambers_Chandra2013 and the biascorrected estimate for the residual variance by Mendez_Lohr2011. We evaluate the performance of point and MSEestimates compared to ‘traditional’ SAEmethods by model and designbased simulations and provide a distinctive SAE example using income data from the Mexican state Nuevo Leòn in Section 5.2. The modelbased simulation in Section 4 demonstrates the ability of point estimates to perform compatibly in classical scenarios and outperform ‘traditional’ methods in the existence nonlinear interactions between covariates. The designbased simulation in Section 5.3 confirms the adequacy of MERFs for point estimates under searingly realistic conditions. The model and designbased simulations indicate that the proposed approach is robust against distributional violations of normality for the random effects and for the unitlevel error terms. Concerning our proposed MSEbootstrapscheme, we conclude its reliability based on the performance in the modelbased simulation in Section 4. Furthermore, we obtain reasonable support for the performance in the application in Section 5.2 and the following designbased simulation in Section 5.3.
We motivate three major dimensions for further research, including theoretical work, aspects of generalizations and advanced applications using Big Data covariates: from a theoretical perspective, further research is needed to investigate the construction of a partialanalytical MSE for arealevel means or the construction of an asymptotic MSEestimator. A conducive strategy is the deduction of recent theoretical results, such as conditions for the consistency of unitlevel predictions (Scornet_etal2015) or considerations of individual predictions intervals (wager_etal2014; Zhang2019), towards arealevel indicators. Additionally, considerations concerning a fully nonparametric formulation of model (1) impose an interesting research direction. From a survey statistical perspective, our proposed method currently abstains from the use of survey weights which bears a risk if the assumption of noninformative sampling is violated. Nevertheless, there exist approaches incorporating weights into random forests (Winham_etal2013)
. The transfer of such ideas to the proposed method of MERFs is subject to ongoing research. Regarding additional generalizations of the proposed method, we aim to extend the use of MERFs towards the estimation of small area quantiles and other nonlinear indicators, such as Ginicoefficients or Head Count Ratios. Furthermore, a generalization towards binary or count data is possible and left to further research. The semiparametric composite formulation of model (
1) allows for to adapt any functional form regarding the estimation of the conditional mean of given and technically transfers to other machine learning methods, such as gradientboosted trees or support vector machines. In terms of advanced applications, we propose the use of MERFs for complex random effect and covariancestructures in empirical problems to the SAEresearch community. Equally interesting is the use of high dimensional supplementary data, i.e. Big Data covariates, for the estimation of arealevel means, that can be directly handled by the proposed MERFframework.Acknowledgements
The authors are grateful to CONEVAL for providing the data used in empirical work. The views set out in this paper are those of the authors and do not reflect the official opinion of CONEVAL. The numerical results are not official estimates and are only produced for illustrating the methods.
Additionally, the authors would like to thank the HPC Service of ZEDAT, Freie Universität Berlin, for computing time.
Appendix A
After convergence of the algorithm introduced in Section 2.2, we obtain an optimal nonparametric estimator for . The best predictor for the random effects for known parameters and must maximize the generalized loglikelihood criterion:
Finding a maximum for is equivalent to the problem of finding a maximizer for in the first term of the summation of the proposed GLLcriterion:
Reshaping leads to:
Now we derive the expression with respect to and set it to 0 in order to find the maximizer
The solution of the maximization problem is given by . Note, for , the optimality solution resembles the BLUP.
Appendix B: additional simulation results and modeldiagnostics
Variable name  Explanation 

ictpc  Total household income per capita 
escol_rel_hog  Average relative amount of schooling standardized 
by age and sex of household members  
bienes  Availability of goods in the household 
jaesc  Average years of schooling of persons in the household 
jnived  Formal education of the head of the household 
actcom  Assets in the household 
pcpering  Percentage of income earners in the household 
jexp  Years of working experience of the head of the household 
pcpering_2  Number of income earners in the household by household size 
pcocup  Percentage of people employed in the household 
jtocup  Occupation type 
Areas  Method  Min  1st Qu.  Median  Mean  3rd Qu.  Max 

Total  BHF  72.58  170.63  336.06  386.40  498.13  1351.82 
EBP  68.83  168.13  341.10  387.91  490.14  1342.92  
EBPBC  65.22  225.84  331.64  376.49  460.86  1094.08  
MERF  82.58  136.01  236.86  298.20  447.21  716.72  
Insample  BHF  111.93  139.13  246.62  301.90  349.26  978.49 
EBP  107.41  143.07  251.95  308.47  348.00  994.45  
EBPBC  145.48  212.64  308.43  314.08  353.02  705.81  
MERF  94.56  123.72  141.16  264.27  392.73  688.74  
Outofsample  BHF  72.58  224.35  422.90  445.55  541.31  1351.82 
EBP  68.83  248.89  412.91  443.52  547.12  1342.92  
EBPBC  65.22  234.98  348.83  420.17  494.83  1094.08  
MERF  82.58  151.24  272.97  321.95  493.85  716.72 