Spatial Modeling for Correlated Cancers Using Bivariate Directed Graphs

by   Leiwen Gao, et al.
Johns Hopkins University

Disease maps are an important tool in cancer epidemiology used for the analysis of geographical variations in disease rates and the investigation of environmental risk factors underlying spatial patterns. Cancer maps help epidemiologists highlight geographic areas with high and low prevalence, incidence, or mortality rates of cancers, and the variability of such rates over a spatial domain. When more than one cancer is of interest, the models must also capture the inherent or endemic association between the diseases in addition to the spatial association. This article develops interpretable and easily implementable spatial autocorrelation models for two or more cancers. The article builds upon recent developments in univariate disease mapping that have shown the use of mathematical structures such as directed acyclic graphs to capture spatial association for a single cancer, estimating inherent or endemic association for two cancers in addition to the association over space (clustering) for each of the cancers. The method builds a Bayesian hierarchical model where the spatial effects are introduced as latent random effects for each cancer. We analyze the relationship between incidence rates of esophagus and lung cancer extracted from the Surveillance, Epidemiology, and End Results (SEER) Program. Our analysis shows statistically significant association between the county-wise incidence rates of lung and esophagus cancer across California. The bivariate directed acyclic graphical model performs better than competing bivariate spatial models in the existing literature.



There are no comments yet.


page 8

page 11

page 12


Hierarchical Multivariate Directed Acyclic Graph Auto-Regressive (MDAGAR) models for spatial diseases mapping

Disease mapping is an important statistical tool used by epidemiologists...

AI-Decision Support System Interface Using Cancer Related Data for Lung Cancer Prognosis

Until the beginning of 2021, lung cancer is known to be the most common ...

Bayesian Models for Multivariate Difference Boundary Detection in Areal Data

Regional aggregates of health outcomes over delineated administrative un...

Failure of complex systems, cascading disasters, and the onset of disease

Complex systems can fail through different routes, often progressing thr...

A Meta-Analysis: Air Quality and Lung Cancer Mortality

Lung cancer has become one of the most epidemic and fatal malignant neop...

Statistical approaches using longitudinal biomarkers for disease early detection: A comparison of methodologies

Early detection of clinical outcomes such as cancer may be predicted bas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Disease mapping, which refers to techniques for mapping and analysis of geographical variations in disease rates and the investigation of environmental risk factors underlying these patterns, has long been an important tool in cancer epidemiology (1). Disease maps are used to highlight geographic areas with high and low prevalence, incidence, or mortality rates of cancers, and the variability of such rates over a spatial domain. They can also be used to detect “hot-spots” or spatial clusters which may arise due to common environmental, demographic, or cultural effects shared by neighboring regions. Maps of crude incidence or mortality rates can be misleading when the population sizes for some of the units are small, which results in large variability in the estimated rates, and makes it difficult to distinguish chance variability from genuine differences. The correct geographic allocation of health care resources can be greatly enhanced by deployment of statistical models that allow a more accurate depiction of true disease rates and their relation to explanatory variables (covariates). Many tasks critical for successful cancer surveillance and control require new inferential methods to handle these complex and often spatially indexed data sets. Since local sample sizes within each spatial region are too low for design-based solutions to attain desired levels of statistical precision (2), much recent work in disease-mapping has been carried out within the context of Bayesian hierarchical models (3). The body of scientific literature on modern methods for geographic disease mapping is too vast to be reviewed here. Comprehensive reviews of prevalent statistical disease mapping methods and their implementation using available software can be found, among several other sources (4, 5, 6, 7).

Statistical models for mapping a single disease have employed probability distributions such as Markov random fields or MRFs

(8) that introduce dependence using the adjacency information among the different regions on a map. Two popular examples are the Conditional Autoregression (CAR) and Simultaneous Autoregression (SAR) models (9, 10, 11, 12) for further discussions on CAR and SAR models. More recently, Directed Acyclic Graphical Autoregressive (DAGAR) models that employ directed acyclic graphs have been developed as a preferred alternative to CAR or SAR models (13). A specific motivation for DAGAR models is that they impart greater interpretability to the spatial autocorrelation parameter.

In this article, we will perform joint spatial mapping of two different types of cancers. Joint modeling is appropriate when different diseases have been observed over the same spatial units and when the diseases themselves are related to each other, say because they share the same set of spatially distributed risk factors or the presence of one disease in a spatial unit may encourage or inhibit the presence of the second disease in the same spatial unit. Put differently, we seek models to capture the spatial association for each disease as well as the association between the diseases. There is, by now, a substantial literature on multivariate disease mapping (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24). These articles have demonstrated, theoretically and empirically, the benefits of jointly modeling several potentially related cancers, as opposed to modeling them independently. While it has been assertively demonstrated that independent models for cancers can lead to biased results because of unaccounted associations among the cancers, the current literature is largely based on using CAR models for spatial mapping. Our proposed bivariate DAGAR model for modeling two diseases over the same spatial region will help epidemiologists and spatial analysts better interpret the association among the cancers.

The balance of this article proceeds by developing a class of bivariate DAGAR models, conducting some disease mapping for two different cancers, and summarizing with some concluding remarks.

2 Methods

Our approach will be to construct a probability model for each disease using the distribution specified by DAGAR. We will extend the univariate DAGAR to a bivariate model by modeling the distribution of one disease as a univariate DAGAR and the conditional distribution of the second disease given the first also as a DAGAR. In this sense, our bivariate DAGAR is analogous to the bivariate CAR models (19). We develop notations and briefly discuss the univariate DAGAR in the next section, followed by the bivariate extension in the following section.

2.1 DAGAR for modeling a single disease

We consider a geographic map of our region of interest (e.g., a particular state) delineated by distinct administrative regions (e.g., counties or ZIP codes) with clear boundaries separating them. Let be a vector consisting of spatially associated random effects corresponding to each region. We develop a spatially correlated model using a directed acyclic graph. The geographic map provides us with a list of neighbors for each region. Neighbors can be defined by the user. Common definitions include when two regions share a common boundary or if their centers are within a certain fixed distance, although the model and resulting distribution theory hold for any fixed set of neighbors. The data structure for the geographic map and its neighbors is defined as a graph, denoted , where the regions are indexed by an ordered set and form the vertices of the graph and

is the collection of edges between the vertices, i.e., the collection of ordered pairs

such that and are geographic neighbors based upon some specified definition.

The DAGAR model specifies , where is a spatial precision matrix that depends only upon a spatial autocorrelation parameter and is a positive scale parameter. To describe , we define neighbor sets , where , i.e. the set excluding the region indexed by , and . Thus, includes geographic neighbors of region that precede in the ordered set . The precision matrix , where is a strictly lower-triangular matrix with entries and is a diagonal matrix with diagonal elements such that


where is the number of members in . The above definition of is consistent with the lower-triangular structure of because for any . The derivation of and as functions of a spatial correlation parameter

is based upon forming local autoregressive models on embedded spanning trees of subgraphs of


2.2 A bivariate DAGAR (BDAGAR) model

We now extend the DAGAR to the bivariate case, where we jointly model two cancers across regions. Let be the spatial random effect vector for disease , where refers to the spatial random effect for disease in region . We will build a hierarchical model,


where denotes a normal density with mean and precision matrix . The precision matrices for are the DAGAR precision matrices formed with the entries of and described in [1] with . Therefore, in [2] we model as a univariate DAGAR and conditional on also as a DAGAR. Each disease has its own distribution and there are two spatial autocorrelation parameters ( and ) corresponding to the two diseases. This ensures that spatial associations specific to each disease will be captured.

The matrix models the association between the two diseases. We use a parametric form , where is the binary adjacency matrix of the geographic map, i.e., if and

otherwise. The joint distribution of

is now derived from [2] as , where the precision matrix is


and the covariance matrix is


We call a normal distribution with the above precision, or covariance, matrix, the BDAGAR model. The interpretation of

and is clear: measures the spatial association for the first cancer, while is the residual spatial correlation in the second cancer after accounting for the first cancer. Similarly, is the spatial precision parameter for the first cancer, while is the residual precision for the second cancer after accounting for the first.

2.3 Model Implementation

Let be our outcome of interest corresponding to cancer in region . We will assume that is a continuous variable, e.g., incidence rates, that is related to a set of explanatory variables through the regression model,


where is a vector of explanatory variables specific to cancer within region , is the slopes corresponding to cancer , ’s are the spatial effects that collectively follow the bivariate DAGAR distribution described in Section 2.2, and capture additional heterogeneity and variability independent of spatial variation, where

is the residual variance for cancer

. The regression model is extended to the following specific Bayesian hierarchical framework with the posterior distribution proportional to


where , , and , and

is the inverse-gamma distribution with shape and rate parameters

and , respectively.

We sample the parameters from the posterior distribution in [2.3

] using Markov chain Monte Carlo (MCMC) with Gibbs sampling and random walk metropolis

(25) as implemented in the rjags package within the R statistical computing environment. To compare and assess models, we use the Widely Applicable Information Criterion (WAIC) (26, 27), which is computed as

where is the expected log point-wise predictive density for a new dataset and is the estimated effective number of parameters, which is sum of posterior variance of the log predictive density for each data point. WAIC is easy to compute using posterior samples.

3 Results

We analyze a data set extracted from the SEERStat database using the SEERStat statistical software (28). We consider 2 cancers, lung and esophagus, where the outcome is the crude incidence rates per 100,000 population in the years from 2012 to 2016 across 58 counties in California, USA. County-level explanatory variables for each cancer are available in the same years and include percentages of residents younger than 18 years old (young), older than 65 years old (old), with education level below high school (edu) , percentages of unemployed residents (unemp), black residents (black), male residents (male), uninsured residents (uninsure), and percentages of families below the poverty threshold (poverty).

We analyzed this data set using the Bayesian hierarchical model [2.3]. The county-level maps of the raw incidence rates per 100,000 population for the two cancers are shown in Figure 1

. The maps exhibit the evidence of correlation across space and between cancers. Cutoffs for the different levels of incidence rates are quantiles for each cancer. For both lung and esophagus cancer, in general, incidence rates are higher in counties located in the northern areas than those in southern part. The four counties in the center including Amador, Calaveras, Tuolumne and Mariposa have relatively high incidence rates compared to the neighboring counties. Overall, counties with similar levels of incidence rates tend to depict some spatial clustering.

For our analysis, we specified the following prior distribution,


where denotes the Uniform density over and is the BDAGAR precision matrix of given in [3].

We fit the BDAGAR model using the two different cancer orders, i.e. and the reverse ordering . We will refer to these orderings simply as and , respectively. Table 1 presents measures for model fit using the WAIC. We also compare BDAGAR with the “Generalized Multivariate Conditional Autoregression (GMCAR)” models (19). In both BDAGAR and GMCAR models, the conditional order has a smaller WAIC (hence better fit to the data) than the reverse ordering. Meanwhile, within each order, BDAGAR seems to excel over the GMCAR with lower scores in both model fit and effective number of parameters, as seen in the values of and , respectively. The preference of WAIC for is also corroborated by the posterior distribution of and from BDAGAR shown in Figure 2. In , the parameter has posterior median of and a credible interval . This shows significant negative values that offset part of the significant positive effect of with a median of and a credible interval of . For , is significantly positive with a median of and credible interval of , while tends to be positive with a median of but with a credible interval that includes . Consequently, we present the following results and analysis for which seems to be the preferred model.

Table 2 summarizes the parameter estimates from the BDAGAR model corresponding to . For fixed effects, the increasing percentage of residents younger than 18 years old significantly reduces the incidence rate for esophagus cancer, while the percentage of residents older than 65 years old has a significantly opposite effect for both esophagus and lung cancer. The increase in the percentage of unemployment also augments the incidence rate of lung cancer significantly. Turning to spatial correlations, measures the residual spatial correlation (posterior mean ) for esophagus cancer after accounting for the explanatory variables and measures the spatial correlation (posterior mean ) for lung cancer after accounting for the explanatory variables and also the effect of esophagus cancer. The small point estimates and narrower credible interval for indicate greater confidence in weaker spatial correlation for esophagus cancer; the moderate value of and a wider credible interval suggest higher spatial correlation for lung cancer. Turning to the spatial precision of random effects for each cancer, the estimates of are indicative of esophagus cancer having larger variability, although we must keep in mind that is the conditional marginal precision for lung cancer after accounting for esophagus cancer and, therefore, may not be directly comparable to .

Figure 3 shows the estimated correlation between lung and esophagus cancer in each of 58 counties. This map also seems to be consistent with the estimates of . Correlations between lung and esophagus cancers in all counties are significantly positive with large means at around which are due to the highly positive values in . This indicates that esophagus cancer is highly correlated with lung cancer. However, in general, the correlation between the two cancers increases slightly from the center to marginal areas, especially for those with fewer counties in the neighborhood.

Finally, Figure 4 provides further visual corroboration of the goodness of fit for the BDAGAR mode corresponding to . Here, we see that the posterior mean of the incidence rates for lung and esophagus cancer are very consistent with the raw incidence rates shown in Figure 1.

Figure 1: Maps of incidence rates per 100,000 population for lung and esophagus cancer in California.
Model lppd WAIC
BDAGAR (esophagus lung) -273.87 44.62 636.99
BDAGAR (lung esophagus) -158.25 50.27 417.05
GMCAR (esophagus lung) -282.76 48.23 661.97
GMCAR (lung esophagus) -158.76 52.33 422.18
Table 1: Model comparison using WAIC statistics for cancer data analysis
Figure 2: Posterior samples of linking parameters from BDAGAR model.
Parameters Esophagus cancer Lung cancer
intercept 15.40 (0.45, 30.36) 0.71 (-52.61, 55.17)
young -0.24 (-0.48, -0.01) -0.93 (-2.16, 0.31)
old 0.23 (0.06, 0.41) 2.93 (1.91, 3.98)
edu -0.02 (-0.15, 0.10) -0.39 (-1.22, 0.44)
unemp 0.12 (-0.05, 0.29) 1.41 (0.22, 2.61)
black 0.16 (-0.09, 0.41) 0.87 (-0.84, 2.54)
male -0.06 (-0.23, 0.10) 0.01 (-1.15, 1.16)
uninsure -0.22 (-0.45, 0.02) 0.24 (-0.78, 1.24)
poverty 0.38 (-0.31, 1.07) 0.47 (-3.98, 5.01)
2.31 (1.55, 3.40) 0.88 (0.17, 3.54)
2.00 (0.76, 3.84) 19.72 (2.25, 56.18)
0.11 (0.00, 0.31) 0.47 (0.02, 0.97)
Table 2: Parameter estimates (posterior means) for the California cancer incidence rate data from BDAGAR model. The numbers inside braces indicates the lower and upper bounds for the 95 credible intervals.
Figure 3: Estimated correlation between lung and esophagus cancer in each of 58 counties of California.
Figure 4: Maps of posterior mean incidence rates per 100,000 population for lung and esophagus cancer in California

4 Discussion

We have extended a recently proposed class of DAGAR models (13) for univariate disease mapping to bivariate “BDAGAR” models that can be applied to estimate spatial correlations for two correlated cancers. The BDAGAR model retains the interpretation of DAGAR models clearly separating the spatial correlation for each cancer from any inherent or endemic association between the two cancers. The BDAGAR model can still be efficiently computed using MCMC algorithms. Our analysis of incidence rates from lung and esophagus cancer demonstrates the efficiency of BDAGAR and its improved performance, as measured by WAIC, over existing alternatives such as the GMCAR models.

While we have restricted our attention only to cancer incidence rates, BDAGAR models can also be used with time-to-event data to investigate geographical patterns in the hazard function. For example, each patient in a study may provide multiple survival times from the onset of each of two cancers along with his or her county of residence. The BDAGAR model can become an excellent alternative to CAR and different MCAR models in spatial survival analysis (29, 17, 30, 31, 21).

Finally, the BDAGAR models developed here proceeds from conditional specifications. Concerns may arise over the ordering of the variables in the hierarchical approach. While in the case of a few cancers, such as 2 in our case, one can evaluate models arising from the different orders, this strategy will become cumbersome with several cancers. For instance, even with cancers, we will have different models that will need to be evaluated and compared. This becomes impractical. A joint modeling approach, analogous to order-free MCAR models as in (20)

, can build rich spatial structures from linear transformations of simpler latent variables. For instance, we can develop alternate multivariate DAGAR, or MDAGAR models, using

, where is a suitably specified square matrix and is a latent vector whose components follow independent univariate DAGAR distributions. Note that by modeling the joint distribution, the incompatibility of conditional model building (i.e., different joint distributions for different orderings) is avoided. However, the issue of the identifiability of is raised, and careful specification of its structure is needed. These approaches will be further investigated elsewhere.


The work of the first and second authors have been supported in part by the Division of Mathematical Sciences (DMS) of the National Science Foundation (NSF) under grant 1916349 and by the National Institute of Environmental Health Sciences (NIEHS) under grants R01ES030210 and 5R01ES027027.


  • (1) Koch T. Cartographies of disease: maps, mapping, and medicine. Esri Press Redlands, CA; 2005.
  • (2) Schaible WL. Indirect estimators in US federal programs. vol. 108. Springer Science & Business Media; 2013.
  • (3) Banerjee S, Carlin BP, Gelfand AE. Hierarchical modeling and analysis for spatial data. CRC Press, Boca Raton, FL; 2014.
  • (4) Best N, Richardson S, Thomson A. A comparison of Bayesian spatial models for disease mapping. Statistical Methods in Medical Research. 2005;14(1):35–59.
  • (5) Waller L, Carlin B. Handbook Of Spatial Statistics. P. Diggle, M. Fuentes, AE Gelfand, and P. Guttorp, Boca Raton, FL: Taylor and Franci. USA: Chapman and Hall CRC Press; 2010.
  • (6) Waller LA, Gotway CA. Applied spatial statistics for public health data. vol. 368. John Wiley & Sons; 2004.
  • (7) Lawson AB. Statistical methods in spatial epidemiology. John Wiley & Sons; 2013.
  • (8) Rua H, Held L. Gaussian Markov Random Fields : Theory and Applications. Monographs on statistics and applied probability. Chapman and Hall/CRC Press, Boca Raton, FL; 2005. Available from:
  • (9) Besag J. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society: Series B (Methodological). 1974;36(2):192–225.
  • (10) Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics. 1991;43(1):1–20.
  • (11) Anselin L. Lagrange multiplier test diagnostics for spatial dependence and spatial heterogeneity. Geographical Analysis. 1988;20(1):1–17.
  • (12) Kissling WD, Carl G. Spatial autocorrelation and the selection of simultaneous autoregressive models. Global Ecology and Biogeography. 2008;17(1):59–71.
  • (13) Datta A, Banerjee S, Hodges JS, et al. Spatial disease mapping using directed acyclic graph auto-regressive (DAGAR) models. Bayesian Analysis. 2018;.
  • (14) Knorr-Held L, Best NG. A shared component model for detecting joint and selective clustering of two diseases. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2001;164(1):73–85.
  • (15) Kim H, Sun D, Tsutakawa RK. A bivariate Bayes method for improving the estimates of mortality rates with a twofold conditional autoregressive model. Journal of the American Statistical Association. 2001;96(456):1506–1521.
  • (16) Gelfand AE, Vounatsou P. Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics. 2003;4(1):11–15.
  • (17) Carlin BP, Banerjee S. Hierarchical multivariate CAR models for spatio-temporally correlated survival data. Bayesian Statistics. 2003;7(7):45–63.
  • (18) Held L, Natário I, Fenton SE, et al. Towards joint disease mapping. Statistical Methods in Medical Research. 2005;14(1):61–82.
  • (19) Jin X, Carlin BP, Banerjee S. Generalized hierarchical multivariate CAR models for areal data. Biometrics. 2005;61(4):950–961.
  • (20) Jin X, Banerjee S, Carlin BP. Order-free co-regionalized areal data models with application to multiple-disease mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2007;69(5):817–838.
  • (21) Diva U, Dey DK, Banerjee S. Parametric models for spatially correlated survival data for individuals with multiple cancers. Statistics in Medicine. 2008;27(12):2127–2144.
  • (22) Zhang Y, Hodges JS, Banerjee S. Smoothed ANOVA with spatial effects as a competitor to MCAR in multivariate spatial smoothing. The Annals of Applied Statistics. 2009;3(4):1805.
  • (23) Martinez-Beneito MA. A general modelling framework for multivariate disease mapping. Biometrika. 2013;100(3):539–553.
  • (24) Marí-Dell’Olmo M, Martinez-Beneito MA, Gotsens M, et al. A smoothed ANOVA model for multivariate ecological regression. Stochastic Environmental Research and Risk Assessment. 2014;28(3):695–706.
  • (25) Gamerman D, Lopes HF.

    Markov chain Monte Carlo: stochastic simulation for Bayesian inference.

    Chapman and Hall/CRC; 2006.
  • (26) Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory.

    Journal of Machine Learning Research. 2010;11(Dec):3571–3594.

  • (27) Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Statistics and Computing. 2014;24(6):997–1016.
  • (28) Step 1: Calculating Age-adjusted Rates - SEER*Stat Tutorials;. Available from:
  • (29) Banerjee S, Wall MM, Carlin BP. Frailty modeling for spatially correlated survival data, with application to infant mortality in Minnesota. Biostatistics. 2003;4(1):123–142.
  • (30) Banerjee S, Dey DK.

    Semiparametric proportional odds models for spatially correlated survival data.

    Lifetime Data Analysis. 2005;11(2):175–191.
  • (31) Cooner F, Banerjee S, McBean AM. Modelling geographically referenced survival data with a cure fraction. Statistical Methods in Medical Research. 2006;15(4):307–324.