Household income is a very important measurement of the development of one region’s economy. It is of great practical interest to examine the effects of covariates on the household income. Since household income data are always collected from different survey centers in different regions, there are two challenges when analyzing household income data. For geographically distributed data, it is not desirable to fit a traditional regression model because the traditional regression model does not account for the spatial dependence among different regions. As a result, the first challenge for spatially dependent data such as household incomes from different regions is to build a suitable regression model. From Banerjee et al. 1 and Cressie 2
, there are different approaches for modelling spatially dependent data, such as the conditional autoregressive model (CAR), the simultaneous autoregressive model (SAR), and the linear regression model with spatial random effects. For the areal data, CAR and SAR are two widely used models. The study region is partitioned into a finite number of areal units with well-defined boundaries1. The spatial correlation structure depends on adjacency matrix of subareas. The CAR model is appropriate for situations with the first order dependency or a relatively local spatial autocorrelation, which assumes that a particular area is influenced by its neighbors. However, the SAR model is more suitable where there is the second order dependency or a more global spatial autocorrelation. The locations of the point reference data vary continuously over the study region. The spatial correlation structure depends on the distance between the locations. The most popular model for point reference data is the regression model with Gaussian spatial random effects 3. Another challenge for analyzing such kind of data is that there exist some missing covariates. Household income data are collected from surveys, so it is common for us to get incomplete data for some covariates. There is rich literature on building regression models with missing covariates. Zhao et al. 45
proposed methods for Bayesian inference of regression models with missing covariates. However, no existing literature deals with spatial data and missing covariates simultaneously.Seshadri 6 proposed a spatial averaging approach for modelling spatial response data only. Bae et al. 7, Xue et al. 8 and Collins et al. 9 also proposed some approaches for dealing with spatial missing data. However, they did not consider missing data model in their approaches. Besides, spatial random effects are not commonly used in missing covariate models to take account of spatial effects. Recently, Grantham et al. 10
built a joint hierarchical model for PM 2.5 and aerosol optical depth (AOD). To deal with missingness of AOD in spatial regression model, they assume informative missingness of AOD and build spatial regression model for AOD to interpolate AOD.
In this paper, we develop a Bayesian spatial regression model to deal with the spatially dependent data with missing covariates using the idea from Ibrahim et al. 5. We assume that the missing covariates are spatially dependent and build hierarchical spatial regression models for both the response variable and missing covariates. Furthermore, we propose the modified Deviance Information Criterion (mDIC) and the modified Logarithm of the Pseudo-Marginal Likelihood (mLPML). One of the main focus of this paper is on the examination of the impact of spatial effects in the missing covariates models on the spatial response model. Our proposed mDIC and mLPML criteria allow us to assess the fit of the spatial response data under covariates models with or without spatial effects. We further conduct extensive simulation studies to examine the empirical performance of the proposed criteria. Such investigation and assessment have not been carried out in the literature based on our best knowledge.
The remainder of this paper is organized as follows. In Section 2, the data from Chinese Health and Nutrition Survey (CHNS) 2011 are introduced as a motivating example. In Section 3, we develop the spatial regression model for the response variable, the model for missing covariates with spatial random effects, and the model for the missing data mechanism. Furthermore, Bayesian model assessment criteria including mDIC and mLPML are used for model comparison. An extensive simulation study is conducted in Section 4 to investigate empirical performance of the models proposed in Section 3. In Section 5, the proposed method is employed to analyze the real data set of CHNS 2011. Finally, we conclude the paper with a brief discussion in Section 6.
2 Motivating Example
Chinese Health and Nutrition Survey (CHNS), a project collaborated by the Carolina Population Center at the University of North Carolina and the National Institute for Nutrition and Health at the Chinese Center for Disease Control and Prevention, aims to examine the relationship between the social and economic transformation of Chinese society and the health and nutritional status of its population. As a geographically distributed data set, CHNS 2011 collected individual-, household- and community-specific information from 12 provinces in China. In this paper, household income from 12 provinces is selected as the spatial response variable, and the aim is to explore the spatial effects and the factors that may have impacts on this variable of interest.
2.1 Data Description
The data were collected from 12 provinces in China with a total sample size of 4346. Household income (hincome) is the response variable. Individual-level covariates include wage of head of the household (indwage), age of head of the household (age), proportion of urban area (urban), number of hours worked last year (WThour), family size (hhsize) and GDP per capita of the province (GDP). The units of hincome, indwage and GDP are CNY.
The sample sizes in different provinces as well as the summary information of the variables are shown in Table 1.
Among these covariates, indwage and WThour have missing values. The average percentages with only indwage or WThour missing are 22.50% and 5.34%, respectively, while the average percentage with both indwage and WThour missing is 22.30%. A summary of the missing patterns of these two covariates are given in Table 2.
2.2 Spatial Structure
In the CHNS 2011 data set, we do not have survey data for all the provinces in China. Also, the provinces included in this data set are not always neighbored with each other. Thus, we treat the CHNS 2011 data as point-referenced data such that the spatial dependence can be possibly and reasonably captured by the distance between two provinces especially when they are away from each other. The centroid latitudes and longitudes of these 12 provinces are given in Table 3. Figure 1 shows the map of mainland China. The provinces which are included in our study are marked in blue color.
Using the coordinates of 12 provinces, we can easily calculate the distance between two provinces. These distances are useful to construct covariance matrices of the spatial random effects in Section 5 below.
In this section, a spatial regression model with missing covariates is built hierarchically. A Gaussian spatial regression model for the response variable is built, after which, missing covariate distributions are built to take account of the missing covariates and covariate-specific spatial effects. In addition, a model capturing the missing data mechanism is also built. After introducing the model construction, posterior inference procedure and model assessment are presented.
3.1 The Spatial Regression Model for Responses
Suppose, we consider locations and observations at location . The spatial response variable at location is denoted by . A Gaussian stationary spatial process model is built for the spatial response variable. The general Gaussian stationary spatial process model can be written as in, for example, Cressie 3:
where is a matrix, is the number of covariates, is the
-dimensional vector withs, is an -dimensional vector of covariates, and is a dimensional vector of corresponding regression coefficients. The spatial random effect is a second-order stationary mean-zero process. To be more specific, conforms that , , and , where is a valid two-dimensional correlation function.
is the white noise process such that
, where “MVN” represents the multivariate normal distribution,is the identity matrix, and for . According to (1), the following spatial hierarchical model is built:
where is the response-specific spatial random effect, is a spatial correlation matrix based on distance and parameter . For the exponential spatial correlation kernel, the th entry of the correlation matrix is , where is the Euclidian distance between location and location , and is the range parameter for spatial correlation. A small value of means a strong spatial correlation, and a large value of means a weak spatial correlation.
3.2 The Spatial Regression Models for Missing Covariates
For survey data, it is common that the data for some covariates are not completely observed. For example, is the th covariate at location and has observations. If there are any missing values among those observations, i.e. if any one of the elements of is missing, is defined as a missing covariate at location . For the CHNS 2011 dataset discussed in Section 2, two missing covariates exist at all locations. Therefore, in this section, we assume that for all locations, among the covariates, the first covariates are missing covariates.
In the presence of missing covariates, a joint model for the missing covariates should be specified to take account of the uncertainty resulting from the missing values in the covariates. To be specific, for the th observation at location , the corresponding -dimensional missing covariate vector is , while the -dimensional complete covariate vector is denoted by . For missing covariate data, it is crucial to specify a model for the missing covariates . Given the spatial random effects, we assume , , are conditionally independent. In general settings, Lipsitz and Ibrahim 11 and Ibrahim et al. 12 specified the missing covariate distribution through a series of one-dimensional conditional distributions. In our case, since the covariates are also spatially distributed, covariate-specific spatial effects are also taken into account in the missing covariate model. We extend their model as
where , represents the spatial effect of covariate at location ,
is a vector of the standard deviations of the spatially structured random errors of the covariates,is a vector of the precisions of the independent random errors of the covariates, and the coefficients associated to the covariates are with being the indexing parameter vector for the th conditional distribution. For the covariate-specific spatial random effects , a multivariate normal distribution similar to (3) can be assumed. As in Grund et al. 13, here we assume the spatial random effects, ’s, of the missing covariates and of the response variable are independent. This assumption is reasonable since captures spatial dependence of the covariate , captures spatial dependence of the response variable, and the dependence between the response variable and the covariates is induced by the spatial regression model in (3).
gave some guidelines for specifying the sequence of one-dimensional conditional distributions. When the missing covariates are categorical, logistic regression for the conditional missing covariate distribution can be specified. Probit or complementary log-log links are also suitable to model categorical covariates. Ordinal regression models can be employed to model missing ordinal covariates. For count variables, we can model them via Poisson regression. And for continuous variables, normal regression, log-normal regression, and exponential regression can be considered.
In our extended model, covariate-specific spatial effects are considered in the model additionally. Conditional on the spatial effects, missing covariates can be modeled according to the above strategy. And for the spatial effects, the same stationary process structure in (3) can be used. In the motivating example, there are missing continuous covariates, and the spatial regression model for these two missing covariates can be written, for and , as
where denotes a vector of the other covariates except , and are the indexing parameter vectors for the distributions of and , respectively, and are the precision parameters of and , and are the standard deviations of and , and and are the corresponding range parameters for spatial correlations of and , which are different than defined in (3).
3.3 Models for Missing Data Mechanism
Assuming a corresponding missing indicator for each missing covariate, for observation we have the -dimensional missing indicator vector with if is observed and if is missing (
). The joint distribution ofcan also be written as the form of a product of one-dimensional conditional distributions, that is
for and , where parameterizes the missingness mechanism model with as a vector of indexing parameters for the th conditional distribution. For each one-dimensional conditional distributions of these binary missing indicators, it is common to build a logistic regression model for each of them.
In the missing data literature, missing data mechanism can be categorized as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR) 15. When missingness does not depend on the covariates that are missing or observed, then the missing data mechanism is termed as MCAR. When missingness depends only on the observed covariates but not on the missing ones, the missing data mechanism is MAR. When neither MCAR nor MAR holds, the missing data mechanism is termed as MNAR.
For simplicity, in our case we assume that the missing data mechanism is MAR, which means that the missing data does not depend on the missing covariates. For missing covariates, the joint distribution of the missing indicators is written as
3.4 Inference Procedure
For the unknown parameters , where , we assume that they are independent a priori. For , the following prior distributions are assigned: , for ; ; ; ; , for ; ; ; ; and , for , where and are the dimensions of covariates in the missing covariate model and missing data mechanism model of the th missing covariate, respectively.
Note that , , , , , , , , , , and
are prespecified hyperparameters. In this article, we use, and , which lead to non-informative priors. With the above prior distributions, the posterior distribution of these unknown parameters based on the observed data with is given by
where refers to the spatial regression model for the response variable in (2), is defined in (4), and is defined in (3.3) . In equation (6), and refer to the distributions of and ’s, respectively, , and denotes the joint prior distribution of the unknown parameters. When a MAR missing data mechanism is assumed, the model for the missing data mechanism does not need to enter the posterior distribution.
The analytical form of the posterior distribution of is unavailable. Therefore, we carry out the posterior inference using the Markov chain Monte Carlo (MCMC) sampling algorithm to sample from the posterior distribution. Instead of sampling from the posterior distributions of the unknown parameters directly, MCMC samples from the full conditional distributions of the parameters with the remaining variables fixed to their current values are obtained. In this way, we can conduct inferences of the proposed model. In our case, spatial random effects are also regarded as unknown parameters, and then the algorithm samples these parameters in turn from their corresponding full conditional distributions.
3.5 Model Assessment
Since our main objective is to assess the fit of the spatial regression model for the response, we specify the following deviance function:
Therefore, we define a modified DIC (mDIC) for the response model as follows:
where , and are the posterior means of parameters and missing covariates. A smaller value of mDIC indicates a better model.
Let denote the observation data with the th subject response deleted. Following Hanson et al. 18, we consider a modified Conditional Predictive Ordinate (mCPO) for the th subject as
where and denotes the normalizing constant. In practice, a Monte Carlo estimate of mCPO using MCMC algorithms from the posterior distributions can be used. To be specific, letting , and () denote a MCMC sample of unknown parameters and missing covariates from the corresponding augmented posterior distribution, a Monte Carlo estimate of is given by
Then mLPML is given by
Similar to the conventional LPML, a larger value of mLPML indicates a more favorable model.
4 A Simulation Study
4.1 Simulation Description
In this simulation study, we randomly generated 20 locations in a space of . For each location, we generated 50 observations based on
where , , is i.i.d. generated from and . Covariate is independently generated from , is generated from , and is generated from , where , , , , and . For both spatial random effects, the th entry of is , where is the distance between and , , , and .
Missing data for are generated with a missing data mechanism that does not depend on , leading to the missing data to be MAR. As a result, the missing data mechanism can be ignored when estimating the parameters. Specifically, let if is observed and if is missing (. The joint distribution of is given by
where , and are the vectors of parameters corresponding to the distributions of and , respectively. We take logistic regression models for and . Thus,
In (13) and (14), and . One hunderd simulated datasets were generated in this study. The average percentages over the 100 simulated datasets with only missing or only missing are 32.82% and 39.27% respectively, while the average percentage with both and missing is 28.72%.
4.2 Simulation Results
According to Section 3, we set up the following model and fix the parameters related to to their true values. The spatial regression model for the response variable is given as
The models for the two missing covariates are given as
The true values of the model parameters are shown in Table 4
. In order to examine empirical performance of the posterior estimates, several assessment measures including average bias (Bias), average standard deviations (SD), mean square error (MSE) and coverage probability (CP) for each parameter are computed. Takingas an example, these measures are given as
where is the true value of and is the total number of simulated datasets while is the posterior mean of . is the estimated standard deviation of , and is the estimated 95% highest probability density (HPD) interval of computed from the th simulated dataset for . Bayesian estimates are obtained via JAGS19 and R20. With the thinning interval to be 20, 5,000 samples are kept for calculation after a burn-in of 10,000 samples. The results of these measures with all records, CC analysis, and model proposed above are shown in Table 4. The difference between “all records", “CC" and “” is on the datasets used to fit the proposed model. “All records" means using the whole dataset before generating the missing ones, “CC" means using the datasets excluding the missing records, and “” means using the datasets with missing values.
From Table 4, we can observe that the biases of the posterior estimates under CC are much greater than those under model . The 95% HPD intervals under are larger than those under CC. Thus, is more preferred than the CC analysis.
In order to assess the performance of the model comparison criteria proposed in Section 3.5, we set up several alternative models with the same response model as but with different missing covariate models as follows: