Bayesian modelling for spatially misaligned health areal data: a multiple membership approach

04/11/2020
by   Marco Gramatica, et al.
Queen Mary University of London
0

Diabetes prevalence is on the rise in the UK, and for public health strategy, estimation of relative disease risk and subsequent mapping is important. We consider an application to London data on diabetes prevalence and mortality. In order to improve the estimation of relative risks we analyse jointly prevalence and mortality data to ensure borrowing strength over the two outcomes. The available data involves two spatial frameworks, areas (middle level super output areas, MSOAs), and general practices (GPs) recruiting patients from several areas. This raises a spatial misalignment issue that we deal with by employing the multiple membership principle. Specifically we translate area spatial effects to explain GP practice prevalence according to proportions of GP populations resident in different areas. A sparse implementation in Stan of both the MCAR and GMCAR allows the comparison of these bivariate priors as well as exploring the different implications for the mapping patterns for both outcomes. The necessary causal precedence of diabetes prevalence over mortality allows a specific conditionality assumption in the GMCAR, not always present in the context of disease mapping.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

page 10

page 16

page 17

08/12/2019

Estimation of the excess mortality in chronic diseases from prevalence and incidence data

Aggregated health data such as claims data from health insurances become...
02/14/2020

Understanding the effects of dichotomization of continuous outcomes on geostatistical inference

Diagnosis is often based on the exceedance or not of continuous health i...
10/22/2020

Statistical methods for linking geostatistical maps and transmission models: Application to lymphatic filariasis in East Africa

Infectious diseases remain one of the major causes of human mortality an...
01/11/2021

Where you live matters: a spatial analysis of COVID-19 mortality

The COVID-19 pandemic has caused   2 million fatalities. Significant pro...
01/02/2020

Detecting Areas of Potential High Prevalence of Chagas in Argentina

A map of potential prevalence of Chagas disease (ChD) with high spatial ...
12/27/2017

New ways of estimating excess mortality of chronic diseases: Insights from the illness-death model

Recently, we have shown that the age-specific prevalence of a disease ca...
10/08/2021

Quantifying Inequality in Underreported Medical Conditions

Estimating the prevalence of a medical condition, or the proportion of t...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and motivating example

When dealing with socio-economic or health data it is common for individuals to be classified according to nested hierarchical structures. When it is necessary to account for level specific effects, hierarchical models are better suited than standard regression techniques

(Fielding and Goldstein, 2006). However, the most common setup of such models requires lower level units to belong to only one class in the hierarchy. When this perfect nesting is absent, as in cross classification (e.g. students classified by area of residence and school) an alternative modelling strategy is needed. Issues of multiple membership may then occur under one (or more) crossed categories (Fielding and Goldstein, 2006). As an educational example, this structure can occur for pupils who attended more than one school during their educational career; when modelling random effects for a particular outcome, it might be useful to account for the different impacts that different schools might have had on the individual student outcome. This is where multiple membership (MM) strategies become an alternative to a classical hierarchical model: by choosing appropriate weights and averaging the random effects, e.g. over different schools attended, it is possible to account for the different memberships of a specific unit at a certain level (Browne et al., 2001). A mixed structure is also exemplified by patients that are classified by both their area of residence and the medical general practitioner (GP) practice they are registered at.

In this paper, the main problem under consideration is to jointly estimate relative risk of diabetes mortality and prevalence where data for the former comes from census Middle Layer Super Output Areas (MSOA), while the latter is collected for GP registers. Generally, in the context of areal disease mapping it is common to model outcomes (e.g. mortality, prevalence) with a generalized linear model (GLM) (Lawson, 2018) using area counts as observations and conditional autoregressive priors (CAR, (Besag et al., 1991) see section 2) to account for spatially structured residuals. Additionally, expected disease counts or disease prevalence, would be included as offsets in the model, to account for different population structures among the areas, and known risk factors as covariates. This method can be applied to two areal outcomes leading to a bivariate model. However, this modelling strategy is based on the two outcomes sharing the same areal framework. This is not the case at hand, as residents of a specific MSOA can be registered at a GP practice outside it, making it impossible to attribute all the patients of a practice to the area it is in.

The dataset used here comprises 130 GP practices in three Clinical Commissioning Groups (CCGs) in outer northern east London (ONEL). MSOAs were included (in ONEL and beyond) if their residents were among the largest contributors to the selected GP populations. This procedure results in 95 MSOAs, 88 of which are actually in ONEL. With this data we aim to analyse prevalence and mortality risk for MSOAs using prevalence information for GP practices.

To the best of the authors’ knowledge, there are no previous Bayesian studies of the spatial causal relationship between diabetes prevalence and mortality. One aim of the study is to examine the strength of this relationship (if any) at a small area scale. This relationship is attenuated at individual patient level because diabetes as a disease can be followed as a cause by death from many other conditions, thus making more difficult to list it as cause of death. Consequently, McEwen et al. (2006) reports that of nearly 12,000 diabetic patients in the US, diabetes was recorded on a minority (39%) of death certificates, and as the underlying cause of death for only 10% of decedents with diabetes. There is also the issue of accurate recording: many studies report that diabetes is an underreported cause of death. Stokes and Preston (2017) report that the proportion of deaths with diabetes assigned as the underlying cause of death (3.3–3.7%) severely understates the contribution of diabetes to mortality in the United States.

Dealing with multiple membership in this example is an illustration of spatial misalignment (Banerjee et al., 2014). In order to deal with the misalignment problem, present in the study here, our approach involves multiple membership (MM) modelling of prevalence data, making MSOA spatial random effects for prevalence indirectly identified by GP prevalence data. Additionally, since causal precedence is evident given that prevalence necessarily precedes mortality, a specific modelling approach will then be evaluated, the GMCAR from Jin et al. (2005), in order to embed the causal information in the estimation process.

Therefore, the main contribution of this paper is methodological: the incorporation of the multiple membership approach into a spatial prior. Previous applications of multiple membership have been in crossed multilevel applications (e.g. educational careers for pupils within schools) with unstructured priors. Another significant contribution is computational, as to the best of our knowledge this is the first implementation of a bivariate CAR spatial prior into Rstan (Carpenter et al., 2017; Stan Development Team, 2018)

, a probabilistic programming platform which does full Bayesian inference using Hamiltonian Monte Carlo (HMC). Previous Rstan implementations or CAR priors have been univariate

(Morris et al., 2019; Joseph, 2016). Lastly, there is novelty in the application of the methods to the specific disease studied, and in showing how spatial priors can be applied to distinct health outcomes (e.g. prevalence, mortality) where causal precedence is clear.

The paper is organised as follows. We briefly described the motivating dataset in Section 1.1. In Section 2 we review conditional autoregressive priors for modelling of bivariate count data and in Section 3 we incorporate the multiple membership approach into a spatial prior. We discuss the implementation of this approach in Stan in Section 3.1. A simulation study is included in Section 4 and the results on our motivating dataset on diabetes prevalence and mortality are presented in Section 5.

1.1 Data structure

Data on diabetes mortality is available from census Middle Layer Super Output Areas (MSOA), while data on diabetes prevalence is available in GP registers. With this data we aim to analyse prevalence and mortality risk for MSOAs using prevalence information for GP practices.

The dataset comprises all 130 GP practices in three Clinical Commissioning Groups (CCGs) in outer northern east London (ONEL): Barking and Dagenham, Havering and Red-bridge. MSOAs were included if their residents were among the largest contributors to the selected GPs populations up to the 90th percentile. This procedure results in 95 MSOAs, 88 of which are actually in ONEL.

The misalignement must be modelled carefully as it is significant: on average 43.4 MSOAs contribute to the population of a single GP (median 41 and range 14 to 80), and the average contribution of each MSOA to the population of a single GP is 2.3% (median 0.19 %) ranging between 0.0059% to 96 % (see Figure 1 as an example for a GP practice).

Mortality data by MSOA include death counts from 2013 to 2017, consequently October 2015 was chosen as midpoint for retrieving prevalence data, the number of registered patients by GP and the cross reference for MSOA and GP populations. In addition to that, two covariates recorded at the MSOA level have been included, namely the proportion of South Asian residents from the 2011 census 222Which include the following ethnicities: Indian, Pakistani and Bangladeshi, taken from Table KS201EW on the NOMIS database from the British Office for National Statistics and the 2015 index of multiple deprivation (IMD). The former variable has been included due to the higher prevalence of diabetes among the South Asian population (Hall et al., 2010), while the latter is a control variable for socio-economic status.

For the purpose of estimating areal prevalence and mortality, it is necessary to take into account population structure, therefore expected counts for each area are computed using national rates for mortality and prevalence, together with the number of residents by age. This gives

(1)

where is the area, the outcome, are the population age groups, are the national mortality and prevalence rates for diabetes broken down by age group together with the corresponding population (Blangiardo and Cameletti, 2015).

Figure 1: Example of contributing MSOAs populations for a specific GP practice, identified by a black dot on the map.

2 Modelling and conditional autoregressive priors (CAR)

Generalised linear models (GLM) with spatially correlated random effects are commonly used in spatial data analysis. We consider a spatial GLM with Gaussian CAR random effects where a region is partitioned in n non-overlapping areas. For each area

we can observe a random variable representing the outcome of interest

and a set of p non-random covariates . Define as known offset values, and as the relative disease risk (Blangiardo and Cameletti, 2015).

The methods in this paper apply to all GLMs but, without loss of generality, in this paper we will model count data using the Negative Binomial distribution with mean

for each observation, a common overdispersion parameter

and a logarithmic link function. The Negative Binomial allows explicit modelling of overdispersion, when the variance in the outcome is larger than the average outcome

(Coly et al., 2019).

We can then write the GLM with the CAR term , defined below, as:

(2)

To model explicitly the overdispersion, we set so it follows that

(3)

where ; , and .

The set of areal units on which data are recorded can form a regular lattice or differ largely in both shape and size. In either case such data typically exhibit spatial autocorrelation, with observations from areal units close together tending to have similar values. By assuming a Gaussian Markov random Field (GMRF) on the lattice (Rue and Held, 2005) it is possible to define through the following full conditionals,

(4)

where is the number of neighbours for area i, is an element of matrix B defined below, and , a propriety parameter, is usually interpreted as a smoothing parameter. Thanks to Brook’s Lemma (Rue and Held, 2005)

it is guaranteed that the joint distribution is properly defined and it is given by

(5)

where , with is the number of neighbours for area i and W is the adjacency matrix such that and if i and j are neighbours. This setup can give rise to different prior choices that require different conditions to achieve properness (Congdon, 2014).

Building from (Mardia, 1988) we can specify a -dimensional multivariate spatial model. However, general conditions for positive definiteness and invertibility for the covariance matrix are difficult to assess. Therefore, we use the following formulation by Gelfand and Vounatsou (2003)

(6)

where is a matrix that controls the variance and covariance between outcomes in the same area and is therefore required to be symmetric positive definite. This model will be denoted as . An MCAR prior for spatial random effects allows estimates of relative risks to borrow strength from each other. This is a reasonable assumption for our motivating example, as diabetes prevalence and mortality are correlated. The MCAR on spatial effects can capture such relationship and improve the estimate for both outcomes.

When considering prevalence and mortality for the same disease, given the necessary precedence of prevalence on mortality, it is possible to draw a causal link between the two phenomena. In order to embed that information in the model we can use the GMCAR (Jin et al., 2005), a specification of the previously mentioned MCAR for the bivariate case, that is based on estimating the spatial effects for one outcome conditionally on the other, as follows.

(7)

In our context, would be the spatial random effect for mortality given the spatial structure of prevalence, while would be the spatial structure for prevalence. The parameter is the smoothing parameter associated with the conditional distribution of , is similar for the marginal distribution of , and and scale the precision of and , respectively. As the conditional expected value is linear in and , these two parameters model the linear relationship between the two diseases risks. The parameter represents an area specific spatial dependence, while refers to the neighbour effect between the two outcomes. We will refer to this model as . A particularly interesting feature of model (7) is that for the bivariate case by setting and we obtain the of equation (6).

3 Multiple membership for areal misalignment

The issue of misalignment arises from the discrepancy in the way mortality and prevalence data are collected. Firstly, patients from different and non neighbouring MSOAs can register at the same GP (i.e. the mapping from MSOAs to GPs is not injective) and, secondly, individuals from the same MSOA can register at different GPs.

Thus, we propose to address this type of misalignment by treating it as a multiple membership problem (MM) (Browne et al., 2001). We obtain practice specific weights based on the relative contribution of each MSOA (in terms of population affiliate) to each practice. Relevant weights are for those MSOAs that contribute to 90% or more of the population of the GP practice.

For each practice different MSOAs contribute with a different weight, so the idea is that the prevalence relative risk for each GP is then computed as an average of MSOA level risks. More explicitly, we observe that for each GP j there are contributing MSOAs. The full model becomes

(8)
(9)
(10)
(11)
(12)
(13)
(14)

where the spatial random effects are modelled as either

(15)

or

(16)

and priors are defined for parameters . The index indicates that MSOA i contributes to GP j with its corresponding weight . Such weights are determined by the share of patients living in MSOA and registered at GP . The covariates coefficients with indexing the outcome (1 for mortality and 2 for prevalence) and the covariate.

3.1 Implementation

To the best of our knowledge, this is the first implementation of GMCAR and MCAR models in RStan (Carpenter et al., 2017). Moreover, we also implement the multiple membership approach within those two models. The code is available on github at https://github.com/silvialiverani/GMCARMM.

All models can be derived from the GMCAR model with multiple membership by setting the relevant parameters to zero: we can obtain an MCAR model by setting the parameter and we can exclude covariates by setting equal to zero.

We implemented a QR parametrisation (section 1.2 of Stan Development Team (2018)) for the estimation of the parameters , which exploits the fact that any design matrix

can be decomposed using the thin QR decomposition into an orthogonal matrix

and an upper-triangular matrix , i.e. . This parametrisation performs much better in practice, often both in terms of wall time and in terms of effective sample size, as it makes it easier for the Monte Carlo algorithm to move in the space when an informative prior on the location of is not available. Moreover, before obtaining the QR decomposition we have also implemented a min-max normalisation of the covariates, which does not affect the distribution of but does affect .

Sampling for both covariance matrices of the CAR component was coded using unit precision, i.e. setting in equation (7), and rescaling the and . Additionally, since calculating the determinant for those covariance matrices would be extremely computationally expensive, Joseph (2016)

implemented an eigenvalue decomposition in Stan, as suggested by

Jin et al. (2005), reducing computational burden significantly.

We used flat priors for the parameters while for the other parameters we used independent, proper, weakly informative prior distributions, as recommended by Gelman et al. (2013). A list of the prior distributions is on the left hand side of Table 1.

Rstan Priors
;
; ; ; ;
Flat prior
; Half-Normal(25)
;
                 
R2OpenBUGS Priors
;
; ; ; ;
; Half-Normal(25)
;
Table 1:

Prior distributions and hyperparameters. The

distribution parametrization is in terms of shape and rate, and the Half-Normal in terms of .

Convergence was assessed with the diagnostic (Gelman and Rubin, 1992), and accordingly satisfactory convergence is achieved for values below 1.01 (Vehtari et al., 2019).

3.2 Model comparison

To choose among competing models, we used the deviance information criterion, or DIC (Spiegelhalter et al., 2002)

. This criterion is based on the posterior distribution of the deviance statistics. For simplicity of notation, we will refer to the parameters of the Negative Binomial with the vector

so we can write its saturated deviance as:

(17)

We compute both the average posterior deviance and the deviance of averages of parameters’ posterior . For the former, when all aspects of the model are assumed true, the approximation holds (Spiegelhalter et al., 2002), where n is the number of observations. Consequently, this can also be used as a measure of fit. Computing the posterior deviances allows to easily compute also the effective number of parameters, .

Moreover, lack of fit of the data with respect to the posterior predictive distribution can be measured by the tail-area probability (TAP), or

-value, of the test quantity (Gelman et al., 2013). For each observation and each outcome we can estimate:

(18)

where the simulated at each HMC iteration form the likelihood conditional on the current sampled parameters. The probabilities are estimated by the proportion of times the events or occur throughout the chain. In case of an adequate model the ’s are expected to concentrate around 0.5 and away from the tails of the distribution, so by evaluating the tails of the estimated TAP’s we can see how frequently the replicated values over (or under) predict the observed ones. Specifically we consider the proportion of TAP’s that higher than 0.95 and lower than 0.05, as well as higher than 0.90 and lower than 0.1.

We also evaluate approximate leave-one-out cross-validation (LOO) (Vehtari et al., 2017)

, comparing models (say A and B) by computing mean and standard error of the difference in

elpd:

(19)

where is the sample variance, i.e.

4 Simulation study

To validate our implementation we coded the GMCAR and MCAR models with multiple membership in BUGS and compared it with the Rstan runs. The BUGS (Bayesian inference Using Gibbs Sampling) project (Lunn et al., 2000)

is concerned with flexible software for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo (MCMC) methods. Altough BUGS is a very popular software for Bayesian hierarchical modelling, it is important to note that it is no longer maintained.

We compared our Rstan implementation of the models above to an implementation in R2OpenBUGS (Sturtz et al., 2005).

We generated two datasets: one with covariates and the other without, both using GMCAR to simulate spatially varying effects. Hence, we estimated all four models with both BUGS and Stan. We chose realistic parameter values in our simulation, in line with the posterior means of the results on real data in Section 5, as well as the true ONEL spatial lattice of 95 MSOAs, the offsets and true covariates mentioned in Section 1.1 to generate new observations for mortality and prevalence.

Direct implementation of the Negative Binomial, as parametrised in equation (3), is not available in BUGS. Therefore we implemented it using the equivalent formulation via a Poisson-Gamma mixture (Hilbe, 2011). Table 1 shows the priors and parameters for both implementations. Both Half-Normal and QR reparametrization are not available in BUGS so for the precision parameters we used weak Gamma priors and for covariates coefficients very vague Normal priors.

The results shown in Tables 2 and 3 were obtained by running 4 chains each with 50% warm-up and 6000 iterations. Are these the details for the Stan or BUGS implementation? Give details for both. Also, mention in writing any additional convergence check that you have done, even if not shown in the tables. Add also comments on the computational performance: although these runs had different number of iterations, how long did they take? What computer power did you use?

The results show that RStan reports stronger convergence, i.e.

closer to 1. All 95% posterior credible intervals contain the true values, but in general RStan estimates smaller intervals.

GMCARMM
True mean sd 2.5% 97.5% Rhat
0.40 0.49 0.28 0.03 0.96 1.00
0.40 0.49 0.28 0.02 0.97 1.00
0.90 0.69 0.23 0.11 0.97 1.09
0.90 0.69 0.23 0.13 0.97 1.00
6.00 7.41 3.07 2.67 14.52 1.01
6.00 7.25 2.90 2.74 13.86 1.00
4.00 5.60 2.25 2.12 10.88 1.03
4.00 6.04 2.59 2.19 12.09 1.00
0.30 0.75 0.41 -0.07 1.55 1.05
0.30 0.71 0.45 -0.12 1.64 1.00
0.50 0.86 0.21 0.48 1.31 1.13
0.50 0.90 0.24 0.50 1.45 1.00
20.00 27.72 13.16 9.86 59.84 1.01
20.00 27.62 13.13 10.05 60.45 1.00
10.00 10.06 1.30 7.68 12.77 1.00
10.00 10.05 1.28 7.67 12.74 1.00
0.50 0.54 0.20 0.16 0.92 1.03
0.50 0.55 0.18 0.18 0.88 1.00
1.30 1.36 0.05 1.28 1.46 1.01
1.30 1.37 0.04 1.27 1.45 1.00
MCARMM
True mean sd 2.5% 97.5% Rhat
0.40 0.40 0.25 0.02 0.90 1.00
0.40 0.40 0.25 0.02 0.88 1.00
6.00 6.40 2.77 2.32 12.84 1.01
6.00 6.24 2.73 2.25 12.65 1.00
4.00 5.75 2.73 1.82 12.14 1.01
4.00 5.60 2.74 1.82 12.15 1.00
0.30 -0.05 0.63 -1.18 1.37 1.01
0.30 -0.18 0.57 -1.31 1.00 1.00
20.00 24.15 11.70 9.08 52.65 1.00
20.00 24.39 11.76 9.25 54.01 1.00
10.00 9.38 1.28 7.09 12.09 1.00
10.00 9.42 1.29 7.11 12.18 1.00
0.50 0.47 0.07 0.34 0.60 1.01
0.50 0.48 0.06 0.36 0.60 1.00
1.30 1.33 0.05 1.24 1.42 1.00
1.30 1.33 0.05 1.25 1.42 1.00
Table 2: Comparison of BUGS (grey lines) and Rstan (white lines) simulation results, models without covariates.
GMCARMM with Covariates
True mean sd 2.5% 97.5% Rhat
0.40 0.57 0.29 0.03 0.98 1.01
0.40 0.55 0.29 0.03 0.98 1.00
0.20 0.45 0.28 0.02 0.96 1.00
0.20 0.44 0.28 0.02 0.95 1.00
6.00 5.53 2.64 1.77 11.61 1.03
6.00 5.74 2.77 1.84 12.28 1.00
4.00 3.77 2.58 0.98 10.58 1.01
4.00 4.32 2.71 1.04 11.16 1.00
0.30 0.12 0.54 -0.90 1.18 1.04
0.30 -0.01 0.61 -1.18 1.31 1.00
0.50 0.04 0.07 -0.10 0.19 1.02
0.50 0.14 0.28 -0.44 0.70 1.00
20.00 21.37 11.92 6.70 52.75 1.01
20.00 21.86 11.81 7.07 52.63 1.00
10.00 9.74 1.61 7.07 13.25 1.00
10.00 9.56 1.52 6.95 12.91 1.00
-0.30 -0.54 0.20 -0.94 -0.15 1.01
-0.30 -0.48 0.19 -0.87 -0.10 1.00
-0.50 -0.45 0.13 -0.68 -0.17 1.02
-0.50 -0.47 0.12 -0.69 -0.21 1.00
0.30 0.87 0.26 0.34 1.38 1.04
0.30 0.78 0.26 0.25 1.28 1.00
0.50 0.61 0.32 -0.05 1.21 1.01
0.50 0.55 0.31 -0.05 1.17 1.00
1.00 1.22 0.16 0.89 1.53 1.00
1.00 1.22 0.16 0.90 1.52 1.00
1.00 0.98 0.23 0.51 1.39 1.01
1.00 1.02 0.20 0.61 1.41 1.00
MCARMM with Covariates
True mean sd 2.5% 97.5% Rhat
0.40 0.45 0.27 0.02 0.93 1.00
0.40 0.45 0.26 0.03 0.92 1.00
6.00 6.44 2.82 2.14 12.89 1.01
6.00 6.43 2.87 2.17 13.13 1.00
4.00 3.19 1.76 1.02 7.70 1.04
4.00 3.67 2.15 1.12 9.31 1.00
0.30 0.08 0.42 -0.73 0.99 1.08
0.30 0.15 0.46 -0.73 1.13 1.00
20.00 23.58 12.46 8.14 55.31 1.01
20.00 24.25 12.48 8.15 55.76 1.00
10.00 12.49 1.97 9.04 16.68 1.01
10.00 12.27 1.92 8.89 16.41 1.00
-0.30 -0.32 0.15 -0.62 -0.03 1.03
-0.30 -0.31 0.15 -0.61 -0.02 1.00
-0.50 -0.37 0.13 -0.61 -0.10 1.02
-0.50 -0.37 0.12 -0.60 -0.12 1.00
0.30 0.43 0.22 -0.03 0.85 1.04
0.30 0.43 0.22 -0.01 0.85 1.00
0.50 0.37 0.27 -0.23 0.89 1.07
0.50 0.36 0.27 -0.17 0.88 1.00
1.00 1.14 0.16 0.82 1.45 1.00
1.00 1.14 0.15 0.84 1.44 1.00
1.00 0.83 0.22 0.38 1.24 1.02
1.00 0.84 0.21 0.42 1.23 1.00
Table 3: Comparison of BUGS (grey lines) and Rstan (white lines) simulation results, models with covariates.

5 Analysis of diabetes data

As detailed in Section 1.1, we aim to analyse prevalence and mortality risk for MSOAs using prevalence information for GP practices. The dataset comprises 130 GP practices and the 95 MSOAs whose residents were among the largest contributors to the selected GPs populations.

To analyse the dataset we ran 4 chains each with 6000 iterations of HMC and 50% warm-up using RStan. The models we considered are the GMCARMM and the MCARMM both including and excluding covariates. Table 4 reports the estimated posterior parameters.

mean se_mean sd 2.5% 5% 50% 95% 97.5% Rhat
GMCAR 0.43 0.01 0.27 0.02 0.04 0.41 0.90 0.94 1.00
GMCAR-C 0.43 0.01 0.27 0.02 0.03 0.41 0.90 0.95 1.00
GMCAR 0.97 0.00 0.03 0.88 0.91 0.98 1.00 1.00 1.01
GMCAR-C 0.36 0.00 0.24 0.01 0.03 0.33 0.81 0.88 1.00
MCAR 0.96 0.00 0.04 0.86 0.89 0.98 1.00 1.00 1.01
MCAR-C 0.33 0.01 0.23 0.01 0.03 0.29 0.76 0.83 1.00
GMCAR 5.85 0.05 2.83 1.85 2.19 5.36 11.19 12.41 1.00
GMCAR-C 6.07 0.05 2.80 1.91 2.31 5.62 11.36 12.58 1.00
MCAR 6.94 0.06 2.91 2.44 2.94 6.55 12.24 13.73 1.00
MCAR-C 5.85 0.04 2.82 1.90 2.20 5.39 11.13 12.54 1.00
GMCAR 3.56 0.04 1.24 1.58 1.81 3.41 5.79 6.42 1.01
GMCAR-C 9.75 0.04 3.04 4.66 5.26 9.48 15.18 16.39 1.00
MCAR 3.35 0.05 1.22 1.51 1.69 3.18 5.59 6.19 1.00
MCAR-C 9.19 0.04 3.06 4.02 4.74 8.91 14.66 16.13 1.00
GMCAR -0.14 0.02 0.48 -1.07 -0.92 -0.15 0.68 0.82 1.01
GMCAR-C 0.24 0.04 1.09 -1.88 -1.56 0.26 1.99 2.31 1.00
MCAR 0.54 0.01 0.26 0.04 0.13 0.54 0.96 1.05 1.01
MCAR-C 0.31 0.02 0.90 -1.51 -1.22 0.34 1.71 2.00 1.00
GMCAR 0.18 0.00 0.11 -0.04 -0.01 0.18 0.35 0.39 1.01
GMCAR-C 0.04 0.02 0.44 -0.82 -0.69 0.03 0.76 0.90 1.00
GMCAR 20.43 0.16 11.30 6.57 7.57 17.55 42.21 49.32 1.00
GMCAR-C 20.60 0.14 11.22 6.71 7.82 17.89 42.51 49.92 1.00
MCAR 18.33 0.18 10.65 5.85 6.82 15.51 39.41 46.29 1.00
MCAR-C 19.87 0.16 11.17 6.37 7.33 17.27 41.04 48.01 1.00
GMCAR 16.56 0.04 2.70 11.98 12.57 16.38 21.35 22.48 1.00
GMCAR-C 17.80 0.03 2.66 12.99 13.80 17.65 22.33 23.77 1.00
MCAR 16.69 0.05 2.65 12.03 12.68 16.51 21.29 22.40 1.00
MCAR-C 17.84 0.03 2.73 13.03 13.71 17.62 22.54 23.66 1.00
GMCAR 0.19 0.01 0.16 -0.09 -0.04 0.18 0.49 0.57 1.02
GMCAR-C -0.35 0.00 0.17 -0.70 -0.63 -0.35 -0.07 -0.01 1.00
MCAR 0.12 0.01 0.20 -0.27 -0.19 0.11 0.46 0.57 1.01
MCAR-C -0.36 0.00 0.15 -0.67 -0.62 -0.36 -0.12 -0.07 1.00
GMCAR 0.12 0.01 0.18 -0.23 -0.14 0.11 0.43 0.51 1.04
GMCAR-C -0.63 0.00 0.08 -0.78 -0.76 -0.63 -0.50 -0.48 1.00
MCAR 0.00 0.02 0.19 -0.41 -0.32 0.00 0.31 0.41 1.02
MCAR-C -0.63 0.00 0.08 -0.79 -0.76 -0.63 -0.50 -0.48 1.00
GMCAR-C 0.51 0.00 0.23 0.04 0.12 0.51 0.90 0.96 1.00
MCAR-C 0.52 0.00 0.21 0.11 0.17 0.51 0.87 0.93 1.00
GMCAR-C 0.65 0.00 0.29 0.09 0.18 0.65 1.12 1.20 1.00
MCAR-C 0.67 0.00 0.27 0.15 0.24 0.67 1.10 1.19 1.00
GMCAR-C 0.86 0.00 0.10 0.67 0.70 0.86 1.02 1.05 1.00
MCAR-C 0.86 0.00 0.10 0.66 0.70 0.86 1.02 1.06 1.00
GMCAR-C 0.83 0.00 0.13 0.57 0.61 0.83 1.04 1.07 1.00
MCAR-C 0.82 0.00 0.13 0.57 0.61 0.82 1.04 1.08 1.00
Table 4: Summary statistics for the estimated parameter for the four models on the real data (‘-C’ stands for models with covariates). : South Asian population coefficient, : IMD coefficient for outcome

Table 4 reports summary statistics of the marginal posterior samples for the parameters of the models.

DIC Mean(TAP)
GMCAR 78.12 94.64 16.52 111.16 412.9 0.63
GMCAR (covariates) 74.68 93.78 19.09 112.87 413.7 0.63
MCAR 83.40 98.15 14.75 112.9 418.0 0.56
MCAR (covariates) 78.91 95.17 16.26 111.43 412.6 0.62
Table 5: Fit measures summary for all four models for Mortality (95 areas)
DIC Mean(TAP)
GMCAR 121.30 141.14 19.84 160.98 1524.2 0.33
GMCAR (covariates) 131.62 142.45 10.83 153.28 1513.0 0.32
MCAR 121.23 141.46 20.23 161.69 1523.6 0.33
MCAR (covariates) 131.12 142.34 11.22 153.56 1509.9 0.31
Table 6: Fit measures summary for all four models for Prevalence (130 GP practices)
Mortality
Model A Model B Mean se
GMCAR MCAR 2.5 1.2
GMCAR-C MCAR-C -0.6 0.4
GMCAR GMCAR-C 0.4 1.5
MCAR MCAR-C -2.3 1.3
Prevalence
Model A Model B Mean se
GMCAR MCAR -0.3 0.6
GMCAR-C MCAR-C -2.5 1.7
GMCAR GMCAR-C -5.6 4.3
MCAR MCAR-C -8.4 2.5
Table 7: Comparison of the mean and standard error of the elpd differences: . Positive value indicates better predictive performance for model A, negative values indicate better performance for model B (‘-C’ stands for models with covariates).

Tables 5 and 6 compare model fit using the DIC and LOO criteria. It can be seen that there are relatively small differences on these criteria between comparable models (i.e. models with covariates and models without covariates). The average posterior deviances are approximately close to the number of observations, suggesting a good fit of the model to the data. We compared LOO for GMCARMM and MCARMM models using differences in elpd as described in Section 3.2. This shows that the GMCARMM mortality model is preferable to the MCARMM one, showing a difference of 2.5 (on the elpd scale) and a standard error of 1.2. Also, the addition of covariates shows a significant improvement for the MCARMM prevalence model with an average elpd difference of 8.4 (standard error 2.5). Other model comparisons in Table 7 do not show significant differences in leave-one-out cross validation prediction.

Mortality Prevalence
GMCAR 0.042 0.08 0.015 0.023
GMCAR-C 0.032 0.074 0.015 0.023
MCAR 0.042 0.084 0.015 0.015
MCAR-C 0.042 0.074 0.015 0.023
Table 8: Proportion of tail area probabilities in different intervals for the four models and both outcomes (‘-C’ stands for models with covariates).

Table 5 and 6 show that the estimated posterior mean values of TAP never exceed 0.63 and are never below 0.31, as expected with adequately fitting models. Additionally, in Table 8 we can clearly see that it is very unlikely for the replicated values across all models to be sampled systematically over or under the dataset values. Figures 2 and 3 display densities for the replicated data at each iteration with the actual data.

(a)
(b)
(c)
(d)
Figure 2: Densities for mortality and replicated data at each replication
(a)
(b)
(c)
(d)
Figure 3: Densities for prevalence and replicated data at each replication

All models fit the data adequately, but there are important differences in the results when covariates are included in the models. Consider the bridging parameters and . The former accounts for within area correlation between spatial effects for different outcomes, and the latter represents between area association (i.e a form of spatial lag effect). Only for the MCAR without covariates shows an entirely positive 95% credible interval (0.04, 1.05). The results in Table 7 show a strong within area association between the two outcomes. For the GMCAR models, for the GMCAR without covariates is close to significance with a 90% credible interval spanning between -0.01 and 0.35. So the association between outcomes remains positive but shows up as via a neighbourhood feedback effect with regards to surrounding areas.

Posterior samples for the coefficients on the covariates, show a strong positive impact, as expected, for both the proportion of South Asian population and the deprivation index IMD, in accounting for both prevalence and mortality variations. Covariate effects are stronger on prevalence variations. The inclusion of such covariates generally improves fit, the exception being the GMCAR model for mortality.

With regards to measures of spatial dependence (the parameters) these are considerably affected by the presence or not of covariates, especially when adding covariates improves fit. For prevalence, inclusion of covariates leads to reduced DIC and LOO-IC and the parameters are correspondingly reduced; the covariates account for spatial dependence formerly in the residuals. For mortality impacts of covariates are significant (albeit smaller than those for prevalence) but fit is not improved when they are included.

(a)
(b)
Figure 4: Estimated mortality relative risks for the area of interest according to the four models.
(a)
(b)
Figure 5: Estimated prevalence relative risks for the area of interest according to the four models.

Figure 4 shows the posterior means for the estimated mortality relative rates, i.e. of equation (8). Figure 5 plots the posterior means of the estimated relative risks for prevalence without considering multiple membership, i.e. as in equation (8). Both maps compare the two CAR models with and without the inclusion of covariates, and it is evident that different models give significantly different results.

6 Conclusions

No previous studies have investigated the bivariate spatial relationship between diabetes mortality and prevalence for small areas. As mentioned above, recording of diabetes deaths data can be problematic, with possible undercounting. Despite these issues, the models applied in the above analysis show strong impacts of acknowledged area risk factors (area deprivation and ethnic population mix) on both outcomes. In models not including covariates, spatial association and dependence parameters are mostly in line with broader epidemiological knowledge: e.g. a positive association between the two outcomes.

The analysis is complicated by the fact that prevalence is not observed directly for areas but for a different set of spatial units: general practitioner catchment areas. We have used a form of multiple membership principle to transfer information between these spatial frameworks. Posterior checks confirm that this approach is successful.

References

  • Assunção (2003) Assunção, R. M. (2003). Space varying coefficient models for small area data. Environmetrics 14(5), 453–473.
  • Banerjee et al. (2014) Banerjee, S., B. P. Carlin, and A. E. Gelfand (2014). Hierarchical modeling and analysis for spatial data. Chapman and Hall/CRC.
  • Besag et al. (1991) Besag, J., J. York, and A. Mollié (1991, mar). Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics 43(1), 1–20.
  • Blangiardo and Cameletti (2015) Blangiardo, M. and M. Cameletti (2015, mar). Spatial and Spatio-temporal Bayesian Models with R - INLA. John Wiley & Sons.
  • Browne et al. (2001) Browne, W. J., H. Goldstein, and J. Rasbash (2001, jul). Multiple membership multiple classification (MMMC) models. Statistical Modelling: An International Journal 1(2), 103–124.
  • Carpenter et al. (2017) Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017, jan). Stan : A Probabilistic Programming Language. Journal of Statistical Software 76(1), 1–32.
  • Coburn (2007) Coburn, T. C. (2007, sep). Hierarchical Modeling and Analysis for Spatial Data, Volume 39. CRC Press.
  • Coly et al. (2019) Coly, S., M. Garrido, D. Abrial, and A.-F. Yao (2019, sep). Bayesian hierarchical models for disease mapping applied to contagious pathologies. bioRxiv, 766071.
  • Congdon (2014) Congdon, P. (2014). Applied Bayesian modelling. John Wiley & Sons.
  • Eberly and Carlin (2000) Eberly, L. E. and B. P. Carlin (2000, sep). Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models. Statistics in Medicine 19(17-18), 2279–2294.
  • Fielding and Goldstein (2006) Fielding, A. and H. Goldstein (2006). Cross-Classified and Multiple Membership Structures in Multilevel Models: An Introduction and Review. Technical report.
  • Gamerman and Lopes (2006) Gamerman, D. and H. F. Lopes (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference.
  • Gamerman et al. (2003) Gamerman, D., A. R. Moreira, and H. Rue (2003, mar). Space-varying regression models: Specifications and simulation. Computational Statistics and Data Analysis 42(3), 513–533.
  • Gelfand and Vounatsou (2003) Gelfand, A. E. and P. Vounatsou (2003, jan).

    Proper multivariate conditional autoregressive models for spatial data analysis.

    Biostatistics 4(1), 11–15.
  • Gelman et al. (2013) Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013). Bayesian data analysis. CRC Press.
  • Gelman and Rubin (1992) Gelman, A. and D. B. Rubin (1992, jan). Inference from iterative simulation using multiple sequences. Statistical Science 7(4), 457–472.
  • Greco and Trivisano (2009) Greco, F. P. and C. Trivisano (2009, may). A multivariate CAR model for improving the estimation of relative risks. Statistics in Medicine 28(12), 1707–1724.
  • Hall et al. (2010) Hall, L. M. L., C. N. Moran, G. R. Milne, J. Wilson, N. G. MacFarlane, N. G. Forouhi, N. Hariharan, I. P. Salt, N. Sattar, and J. M. R. Gill (2010, dec). Fat Oxidation, Fitness and Skeletal Muscle Expression of Oxidative/Lipid Metabolism Genes in South Asians: Implications for Insulin Resistance? PLoS ONE 5(12), e14197.
  • Hilbe (2011) Hilbe, J. M. (2011, jan). Negative binomial regression, second edition. Cambridge University Press.
  • Jin et al. (2005) Jin, X., B. P. Carlin, and S. Banerjee (2005, dec). Generalized hierarchical multivariate CAR models for areal data. Biometrics 61(4), 950–961.
  • Joseph (2016) Joseph, M. (2016). Exact sparse CAR models in Stan.
  • Lawson (2018) Lawson, A. (2018, may). Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology, Third Edition. CRC Press.
  • Lawson et al. (2012) Lawson, A. B., J. Choi, B. Cai, M. Hossain, R. S. Kirby, and J. Liu (2012, sep). Bayesian 2-Stage Space-Time Mixture Modeling With Spatial Misalignment of the Exposure in Small Area Health Data. Journal of Agricultural, Biological, and Environmental Statistics 17(3), 417–441.
  • Lee (2011) Lee, D. (2011, jun). A comparison of conditional autoregressive models used in Bayesian disease mapping. Spatial and Spatio-temporal Epidemiology 2(2), 79–89.
  • Lunn et al. (2000) Lunn, D. J., A. Thomas, N. Best, and D. Spiegelhalter (2000). WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing 10(4), 325–337.
  • Manda et al. (2012) Manda, S., R. Feltbower, and M. Gilthorpe (2012, jan). Review and empirical comparison of joint mapping of multiple diseases. Southern African Journal of Epidemiology and Infection 27(4), 169–182.
  • Mardia (1988) Mardia, K. (1988, feb). Multi-dimensional multivariate Gaussian Markov random fields with application to image processing.

    Journal of Multivariate Analysis

     24(2), 265–284.
  • McCullagh and Nelder (1989) McCullagh, P. and J. A. Nelder (1989, aug). Generalized Linear Models, Second Edition, Chapman & Hall/CRC, Boca Raton, FL. CRC Press.
  • McEwen et al. (2006) McEwen, L. N., C. Kim, M. Haan, D. Ghosh, P. M. Lantz, C. M. Mangione, M. M. Safford, D. Marrero, T. J. Thompson, and W. H. Herman (2006, feb). Diabetes reporting as a cause of death: Results from the translating research into action for diabetes (TRIAD) study. Diabetes Care 29(2), 247–253.
  • Morris et al. (2019) Morris, M., K. Wheeler-Martin, D. Simpson, S. J. Mooney, A. Gelman, and C. DiMaggio (2019, nov). Bayesian hierarchical spatial models: Implementing the Besag York Mollié model in stan. Spatial and Spatio-temporal Epidemiology 31, 100301.
  • Rue and Held (2005) Rue, H. and L. Held (2005). Gaussian Markov random fields : theory and applications. Chapman & Hall/CRC.
  • Snow (1855) Snow, J. (1855). On the Mode of Communication of Cholera. John Churchill, New Burlington Street, England.
  • Spiegelhalter et al. (2002) Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. van der Linde (2002, oct). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(4), 583–639.
  • Stan Development Team (2018) Stan Development Team (2018). Stan Modeling Language Users Guide and Reference Manual.
  • Stokes and Preston (2017) Stokes, A. and S. H. Preston (2017, jan). Deaths Attributable to Diabetes in the United States: Comparison of Data Sources and Estimation Approaches. PLOS ONE 12(1), e0170219.
  • Sturtz et al. (2005) Sturtz, S., U. Ligges, and A. Gelman (2005, jan). R2WinBUGS: A package for running WinBUGS from R. Journal of Statistical Software 12, 1–16.
  • Vehtari et al. (2017) Vehtari, A., A. Gelman, and J. Gabry (2017, sep). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27(5), 1413–1432.
  • Vehtari et al. (2019) Vehtari, A., A. Gelman, D. Simpson, B. Carpenter, and P.-C. Bürkner (2019). Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC.
  • Wakefield and Shaddick (2006) Wakefield, J. and G. Shaddick (2006, dec). Health-exposure modeling and the ecological fallacy. Biostatistics 7(3), 438–455.