Bayesian Hierarchical Spatial Regression Models for Spatial Data in the Presence of Missing Covariates with Applications

07/05/2020 ∙ by Zhihua Ma, et al. ∙ 0

In many applications, survey data are collected from different survey centers in different regions. It happens that in some circumstances, response variables are completely observed while the covariates have missing values. In this paper, we propose a joint spatial regression model for the response variable and missing covariates via a sequence of one-dimensional conditional spatial regression models. We further construct a joint spatial model for missing covariate data mechanisms. The properties of the proposed models are examined and a Markov chain Monte Carlo sampling algorithm is used to sample from the posterior distribution. In addition, the Bayesian model comparison criteria, the modified Deviance Information Criterion (mDIC) and the modified Logarithm of the Pseudo-Marginal Likelihood (mLPML), are developed to assess the fit of spatial regression models for spatial data. Extensive simulation studies are carried out to examine empirical performance of the proposed methods. We further apply the proposed methodology to analyze a real data set from a Chinese Health and Nutrition Survey (CHNS) conducted in 2011.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Household income is a very important measurement of the development of one region’s economy. It is of great practical interest to examine the effects of covariates on the household income. Since household income data are always collected from different survey centers in different regions, there are two challenges when analyzing household income data. For geographically distributed data, it is not desirable to fit a traditional regression model because the traditional regression model does not account for the spatial dependence among different regions. As a result, the first challenge for spatially dependent data such as household incomes from different regions is to build a suitable regression model. From Banerjee et al. 1 and Cressie 2

, there are different approaches for modelling spatially dependent data, such as the conditional autoregressive model (CAR), the simultaneous autoregressive model (SAR), and the linear regression model with spatial random effects. For the areal data, CAR and SAR are two widely used models. The study region is partitioned into a finite number of areal units with well-defined boundaries

1. The spatial correlation structure depends on adjacency matrix of subareas. The CAR model is appropriate for situations with the first order dependency or a relatively local spatial autocorrelation, which assumes that a particular area is influenced by its neighbors. However, the SAR model is more suitable where there is the second order dependency or a more global spatial autocorrelation. The locations of the point reference data vary continuously over the study region. The spatial correlation structure depends on the distance between the locations. The most popular model for point reference data is the regression model with Gaussian spatial random effects 3. Another challenge for analyzing such kind of data is that there exist some missing covariates. Household income data are collected from surveys, so it is common for us to get incomplete data for some covariates. There is rich literature on building regression models with missing covariates. Zhao et al. 4

used estimating equations for regression analysis in the presence of missing observations on one covariate.

Ibrahim et al. 5

proposed methods for Bayesian inference of regression models with missing covariates. However, no existing literature deals with spatial data and missing covariates simultaneously.

Seshadri 6 proposed a spatial averaging approach for modelling spatial response data only. Bae et al. 7, Xue et al. 8 and Collins et al. 9 also proposed some approaches for dealing with spatial missing data. However, they did not consider missing data model in their approaches. Besides, spatial random effects are not commonly used in missing covariate models to take account of spatial effects. Recently, Grantham et al. 10

built a joint hierarchical model for PM 2.5 and aerosol optical depth (AOD). To deal with missingness of AOD in spatial regression model, they assume informative missingness of AOD and build spatial regression model for AOD to interpolate AOD.

In this paper, we develop a Bayesian spatial regression model to deal with the spatially dependent data with missing covariates using the idea from Ibrahim et al. 5. We assume that the missing covariates are spatially dependent and build hierarchical spatial regression models for both the response variable and missing covariates. Furthermore, we propose the modified Deviance Information Criterion (mDIC) and the modified Logarithm of the Pseudo-Marginal Likelihood (mLPML). One of the main focus of this paper is on the examination of the impact of spatial effects in the missing covariates models on the spatial response model. Our proposed mDIC and mLPML criteria allow us to assess the fit of the spatial response data under covariates models with or without spatial effects. We further conduct extensive simulation studies to examine the empirical performance of the proposed criteria. Such investigation and assessment have not been carried out in the literature based on our best knowledge.

The remainder of this paper is organized as follows. In Section 2, the data from Chinese Health and Nutrition Survey (CHNS) 2011 are introduced as a motivating example. In Section 3, we develop the spatial regression model for the response variable, the model for missing covariates with spatial random effects, and the model for the missing data mechanism. Furthermore, Bayesian model assessment criteria including mDIC and mLPML are used for model comparison. An extensive simulation study is conducted in Section 4 to investigate empirical performance of the models proposed in Section 3. In Section 5, the proposed method is employed to analyze the real data set of CHNS 2011. Finally, we conclude the paper with a brief discussion in Section 6.

2 Motivating Example

Chinese Health and Nutrition Survey (CHNS), a project collaborated by the Carolina Population Center at the University of North Carolina and the National Institute for Nutrition and Health at the Chinese Center for Disease Control and Prevention, aims to examine the relationship between the social and economic transformation of Chinese society and the health and nutritional status of its population. As a geographically distributed data set, CHNS 2011 collected individual-, household- and community-specific information from 12 provinces in China. In this paper, household income from 12 provinces is selected as the spatial response variable, and the aim is to explore the spatial effects and the factors that may have impacts on this variable of interest.

2.1 Data Description

The data were collected from 12 provinces in China with a total sample size of 4346. Household income (hincome) is the response variable. Individual-level covariates include wage of head of the household (indwage), age of head of the household (age), proportion of urban area (urban), number of hours worked last year (WThour), family size (hhsize) and GDP per capita of the province (GDP). The units of hincome, indwage and GDP are CNY.

The sample sizes in different provinces as well as the summary information of the variables are shown in Table 1.

Beijing Liaoning Heilongjiang Shanghai Jiangsu Shandong
Sample size 415 395 396 424 412 399
hincome mean 75599.23 49426.97 46861.01 87455.34 61393.95 40999.05
sd 49926.75 47862.42 44386.13 68695.15 43495.38 40926.51
indwage mean 41029.76 25021.40 29590.38 41829.25 20894.34 20769.75
sd 41730.44 25584.54 40866.63 45704.46 20348.74 24941.24
age mean 49.30 56.40 51.15 56.28 59.31 56.01
sd 13.13 11.88 11.42 11.68 11.83 11.38
urban proportion 0.86 0.30 0.37 0.83 0.33 0.29
sd 0.34 0.46 0.48 0.38 0.47 0.46
WThour mean 44.21 38.15 27.78 41.43 35.86 43.09
sd 12.22 24.77 24.50 8.77 22.49 18.10
hhsize mean 2.80 2.85 2.62 3.20 3.22 3.01
sd 0.84 1.16 1.01 1.10 1.51 1.33
Henan Hubei Hunan Guangxi Guizhou Chongqing
Sample size 298 337 244 360 339 327
hincome mean 36782.92 50417.49 48163.62 37022.83 45388.52 41770.31
sd 42655.65 57341.40 43458.89 33374.35 52696.84 39894.29
indwage mean 16022.50 21769.90 27191.26 12122.19 22694.39 25977.72
sd 24142.06 28241.78 29676.95 14334.89 39639.95 34609.48
age mean 53.96 54.67 53.39 55.27 56.21 52.48
sd 12.18 10.38 12.44 12.43 12.58 11.52
urban proportion 0.37 0.33 0.41 0.29 0.33 0.54
sd 0.48 0.47 0.49 0.45 0.47 0.50
WThour mean 37.12 38.23 39.31 36.32 32.07 38.85
sd 23.11 18.53 17.01 21.36 18.95 18.97
hhsize mean 3.67 3.27 3.25 4.14 3.43 3.20
sd 1.49 1.48 1.35 1.76 1.43 1.11
Table 1: Sample size and summary information of the variables in each province

Among these covariates, indwage and WThour have missing values. The average percentages with only indwage or WThour missing are 22.50% and 5.34%, respectively, while the average percentage with both indwage and WThour missing is 22.30%. A summary of the missing patterns of these two covariates are given in Table 2.

Beijing Liaoning Heilongjiang Shanghai Jiangsu Shandong missing indwage only 1.20% 25.57% 41.92% 0.94% 18.20% 21.55% missing WThour only 4.82% 3.04% 4.55% 3.54% 7.52% 8.77% missing indwage and WThour 30.12% 29.37% 12.37% 43.40% 19.90% 26.32% Henan Hubei Hunan Guangxi Guizhou Chongqing missing indwage only 22.48% 32.94% 23.77% 28.33% 31.86% 29.05% missing WThour only 5.37% 5.93% 4.10% 6.94% 2.95% 6.12% missing indwage and WThour 12.75% 15.13% 19.67% 13.89% 17.40% 18.96%

Table 2: Missing percentages in each province

2.2 Spatial Structure

In the CHNS 2011 data set, we do not have survey data for all the provinces in China. Also, the provinces included in this data set are not always neighbored with each other. Thus, we treat the CHNS 2011 data as point-referenced data such that the spatial dependence can be possibly and reasonably captured by the distance between two provinces especially when they are away from each other. The centroid latitudes and longitudes of these 12 provinces are given in Table 3. Figure 1 shows the map of mainland China. The provinces which are included in our study are marked in blue color.

Province Beijing Liaoning Heilongjiang Shanghai Jiangsu Shandong Longitude 116.4107 122.6090 127.7824 121.4037 119.4554 118.1490 Latitude 40.1849 41.3037 47.8415 31.0846 32.9732 36.3512 Province Henan Hubei Hunan Guangxi Guizhou Chongqing Longitude 113.6136 112.2691 111.7083 108.7872 106.8738 107.8748 Latitude 33.8826 30.9760 27.6069 23.8279 26.8152 30.0587

Table 3: Centroid Coordinates of each province
Figure 1: China Map (Blue indicates the province which are included in the study)

Using the coordinates of 12 provinces, we can easily calculate the distance between two provinces. These distances are useful to construct covariance matrices of the spatial random effects in Section 5 below.

3 Methodology

In this section, a spatial regression model with missing covariates is built hierarchically. A Gaussian spatial regression model for the response variable is built, after which, missing covariate distributions are built to take account of the missing covariates and covariate-specific spatial effects. In addition, a model capturing the missing data mechanism is also built. After introducing the model construction, posterior inference procedure and model assessment are presented.

3.1 The Spatial Regression Model for Responses

Suppose, we consider locations and observations at location . The spatial response variable at location is denoted by . A Gaussian stationary spatial process model is built for the spatial response variable. The general Gaussian stationary spatial process model can be written as in, for example, Cressie 3:


where is a matrix, is the number of covariates, is the

-dimensional vector with

s, is an -dimensional vector of covariates, and is a dimensional vector of corresponding regression coefficients. The spatial random effect is a second-order stationary mean-zero process. To be more specific, conforms that , , and , where is a valid two-dimensional correlation function.

is the white noise process such that

, where “MVN” represents the multivariate normal distribution,

is the identity matrix, and for . According to (1), the following spatial hierarchical model is built:


where is the response-specific spatial random effect, is a spatial correlation matrix based on distance and parameter . For the exponential spatial correlation kernel, the th entry of the correlation matrix is , where is the Euclidian distance between location and location , and is the range parameter for spatial correlation. A small value of means a strong spatial correlation, and a large value of means a weak spatial correlation.

3.2 The Spatial Regression Models for Missing Covariates

For survey data, it is common that the data for some covariates are not completely observed. For example, is the th covariate at location and has observations. If there are any missing values among those observations, i.e. if any one of the elements of is missing, is defined as a missing covariate at location . For the CHNS 2011 dataset discussed in Section 2, two missing covariates exist at all locations. Therefore, in this section, we assume that for all locations, among the covariates, the first covariates are missing covariates.

In the presence of missing covariates, a joint model for the missing covariates should be specified to take account of the uncertainty resulting from the missing values in the covariates. To be specific, for the th observation at location , the corresponding -dimensional missing covariate vector is , while the -dimensional complete covariate vector is denoted by . For missing covariate data, it is crucial to specify a model for the missing covariates . Given the spatial random effects, we assume , , are conditionally independent. In general settings, Lipsitz and Ibrahim 11 and Ibrahim et al. 12 specified the missing covariate distribution through a series of one-dimensional conditional distributions. In our case, since the covariates are also spatially distributed, covariate-specific spatial effects are also taken into account in the missing covariate model. We extend their model as


where , represents the spatial effect of covariate at location ,

is a vector of the standard deviations of the spatially structured random errors of the covariates,

is a vector of the precisions of the independent random errors of the covariates, and the coefficients associated to the covariates are with being the indexing parameter vector for the th conditional distribution. For the covariate-specific spatial random effects , a multivariate normal distribution similar to (3) can be assumed. As in Grund et al. 13, here we assume the spatial random effects, ’s, of the missing covariates and of the response variable are independent. This assumption is reasonable since captures spatial dependence of the covariate , captures spatial dependence of the response variable, and the dependence between the response variable and the covariates is induced by the spatial regression model in (3).

There are many possibilities in (4), especially when is large. Chen and Ibrahim 14

gave some guidelines for specifying the sequence of one-dimensional conditional distributions. When the missing covariates are categorical, logistic regression for the conditional missing covariate distribution can be specified. Probit or complementary log-log links are also suitable to model categorical covariates. Ordinal regression models can be employed to model missing ordinal covariates. For count variables, we can model them via Poisson regression. And for continuous variables, normal regression, log-normal regression, and exponential regression can be considered.

In our extended model, covariate-specific spatial effects are considered in the model additionally. Conditional on the spatial effects, missing covariates can be modeled according to the above strategy. And for the spatial effects, the same stationary process structure in (3) can be used. In the motivating example, there are missing continuous covariates, and the spatial regression model for these two missing covariates can be written, for and , as

where denotes a vector of the other covariates except , and are the indexing parameter vectors for the distributions of and , respectively, and are the precision parameters of and , and are the standard deviations of and , and and are the corresponding range parameters for spatial correlations of and , which are different than defined in (3).

3.3 Models for Missing Data Mechanism

Assuming a corresponding missing indicator for each missing covariate, for observation we have the -dimensional missing indicator vector with if is observed and if is missing (

). The joint distribution of

can also be written as the form of a product of one-dimensional conditional distributions, that is


for and , where parameterizes the missingness mechanism model with as a vector of indexing parameters for the th conditional distribution. For each one-dimensional conditional distributions of these binary missing indicators, it is common to build a logistic regression model for each of them.

In the missing data literature, missing data mechanism can be categorized as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR) 15. When missingness does not depend on the covariates that are missing or observed, then the missing data mechanism is termed as MCAR. When missingness depends only on the observed covariates but not on the missing ones, the missing data mechanism is MAR. When neither MCAR nor MAR holds, the missing data mechanism is termed as MNAR.

For simplicity, in our case we assume that the missing data mechanism is MAR, which means that the missing data does not depend on the missing covariates. For missing covariates, the joint distribution of the missing indicators is written as

3.4 Inference Procedure

For the unknown parameters , where , we assume that they are independent a priori. For , the following prior distributions are assigned: , for ; ; ; ; , for ; ; ; ; and , for , where and are the dimensions of covariates in the missing covariate model and missing data mechanism model of the th missing covariate, respectively.

Note that , , , , , , , , , , and

are prespecified hyperparameters. In this article, we use

, and , which lead to non-informative priors. With the above prior distributions, the posterior distribution of these unknown parameters based on the observed data with is given by


where refers to the spatial regression model for the response variable in (2), is defined in (4), and is defined in (3.3) . In equation (6), and refer to the distributions of and ’s, respectively, , and denotes the joint prior distribution of the unknown parameters. When a MAR missing data mechanism is assumed, the model for the missing data mechanism does not need to enter the posterior distribution.

The analytical form of the posterior distribution of is unavailable. Therefore, we carry out the posterior inference using the Markov chain Monte Carlo (MCMC) sampling algorithm to sample from the posterior distribution. Instead of sampling from the posterior distributions of the unknown parameters directly, MCMC samples from the full conditional distributions of the parameters with the remaining variables fixed to their current values are obtained. In this way, we can conduct inferences of the proposed model. In our case, spatial random effects are also regarded as unknown parameters, and then the algorithm samples these parameters in turn from their corresponding full conditional distributions.

3.5 Model Assessment

Within the Bayesian framework, the Deviance Information Criterion (DIC) 16 and the Logarithm of the Pseudo-Marginal Likelihood (LPML) 17 are two well-known Bayesian criteria for model comparison.

Since our main objective is to assess the fit of the spatial regression model for the response, we specify the following deviance function:


Therefore, we define a modified DIC (mDIC) for the response model as follows:


where , and are the posterior means of parameters and missing covariates. A smaller value of mDIC indicates a better model.

Let denote the observation data with the th subject response deleted. Following Hanson et al. 18, we consider a modified Conditional Predictive Ordinate (mCPO) for the th subject as


where and denotes the normalizing constant. In practice, a Monte Carlo estimate of mCPO using MCMC algorithms from the posterior distributions can be used. To be specific, letting , and () denote a MCMC sample of unknown parameters and missing covariates from the corresponding augmented posterior distribution, a Monte Carlo estimate of is given by


Then mLPML is given by


Similar to the conventional LPML, a larger value of mLPML indicates a more favorable model.

4 A Simulation Study

4.1 Simulation Description

In this simulation study, we randomly generated 20 locations in a space of . For each location, we generated 50 observations based on

where , , is i.i.d. generated from and . Covariate is independently generated from , is generated from , and is generated from , where , , , , and . For both spatial random effects, the th entry of is , where is the distance between and , , , and .

Missing data for are generated with a missing data mechanism that does not depend on , leading to the missing data to be MAR. As a result, the missing data mechanism can be ignored when estimating the parameters. Specifically, let if is observed and if is missing (. The joint distribution of is given by


where , and are the vectors of parameters corresponding to the distributions of and , respectively. We take logistic regression models for and . Thus,




In (13) and (14), and . One hunderd simulated datasets were generated in this study. The average percentages over the 100 simulated datasets with only missing or only missing are 32.82% and 39.27% respectively, while the average percentage with both and missing is 28.72%.

4.2 Simulation Results

According to Section 3, we set up the following model and fix the parameters related to to their true values. The spatial regression model for the response variable is given as

The models for the two missing covariates are given as

The true values of the model parameters are shown in Table 4

. In order to examine empirical performance of the posterior estimates, several assessment measures including average bias (Bias), average standard deviations (SD), mean square error (MSE) and coverage probability (CP) for each parameter are computed. Taking

as an example, these measures are given as

where is the true value of and is the total number of simulated datasets while is the posterior mean of . is the estimated standard deviation of , and is the estimated 95% highest probability density (HPD) interval of computed from the th simulated dataset for . Bayesian estimates are obtained via JAGS19 and R20. With the thinning interval to be 20, 5,000 samples are kept for calculation after a burn-in of 10,000 samples. The results of these measures with all records, CC analysis, and model proposed above are shown in Table 4. The difference between “all records", “CC" and “” is on the datasets used to fit the proposed model. “All records" means using the whole dataset before generating the missing ones, “CC" means using the datasets excluding the missing records, and “” means using the datasets with missing values.

True value All records CC Bias SD MSE CP Bias SD MSE CP 1 -0.0495 0.4600 0.2140 0.97 0.2152 0.4874 0.2838 0.94 1.5 -0.0048 0.0710 0.0051 0.97 -0.0471 0.0975 0.0117 0.89 1 0.0008 0.0313 0.0010 0.95 -0.0230 0.0451 0.0026 0.91 2 -0.0014 0.0462 0.0021 0.94 -0.0279 0.0687 0.0055 0.87 1 -0.0017 0.0410 0.0017 0.94 0.0313 0.0695 0.0058 0.90 0 -0.0375 0.4903 0.2418 0.96 0.1626 0.4844 0.2611 0.93 0 0.0078 0.0460 0.0022 0.96 -0.0167 0.0585 0.0037 0.93 2 -0.0063 0.0314 0.0010 0.95 -0.0975 0.0498 0.0120 0.49 1 -0.0011 0.0452 0.0020 0.94 0.0141 0.0573 0.0035 0.97 0 -0.0992 0.4364 0.2003 0.97 0.5441 0.3190 0.3979 0.86 1 0.0029 0.0318 0.0010 0.94 -0.2997 0.0702 0.0948 0.00 1 0.0038 0.0422 0.0018 0.94 0.3301 0.1138 0.1219 0.00 True value Bias SD MSE CP 1 -0.0178 0.5091 0.2595 0.92 1.5 0.0036 0.0916 0.0084 0.93 1 0.0011 0.0411 0.0017 0.92 2 -0.0037 0.0630 0.0040 0.92 1 0.0042 0.0656 0.0043 0.92 0 -0.0386 0.4999 0.2514 0.94 0 0.0012 0.0549 0.0030 0.95 2 0.0023 0.0371 0.0014 0.98 1 -0.0130 0.0555 0.0033 0.95 0 0.0197 0.4379 0.1921 0.94 1 -0.0004 0.0334 0.0011 0.91 1 0.0056 0.0505 0.0026 0.93

Table 4: Simulation results of assessment measures with all records, CC, and model

From Table 4, we can observe that the biases of the posterior estimates under CC are much greater than those under model . The 95% HPD intervals under are larger than those under CC. Thus, is more preferred than the CC analysis.

In order to assess the performance of the model comparison criteria proposed in Section 3.5, we set up several alternative models with the same response model as but with different missing covariate models as follows: