In recent years a considerable amount of research has been carried out to understand the complex pathogenesis of Type 2 Diabetes (T2D) as well as its role as a cardiovascular risk factor. Its prevalence is increasing world wide at an alarming rate and some regions like China and India are considered of particular interest [5, 13], as they have been considered at the epicenter of this epidemic.
The American Diabetes Association (ADA) International Expert Committee recently proposed the inclusion of glycated hemoglobin (HbA1c) as a diagnostic test for T2D. The possibility of a simpler and efficient diagnosis is attractive. However, some relevant findings have been reported and they suggest that a deeper understanding of this tool is desirable.
One question that has arised in the literature is whether the HbA1c can provide more information about atherosclerosis compared to that provided by a 2-hour postchallenge glucose (PG2-H) measurement. 
, evaluated the association between hyperglycemic markers and cardiovascular risk, which has been reported as non consistent for non-diabetic subjects, in a population from Shanghai. In particular, using linear regression models, they analysed the association between HbA1c and PG2-H with carotid intima-media thickness (CIMT), a well accepted estimate of carotid atherosclerosis. They reported that the population of normoglycemic subjects they analysed had a higher association between HbA1c and CIMT than between PG2-H and CIMT, suggesting HbA1c could be more informative of cardiovascular risk., exploring the UK Biobank using Cox models to asses risk, also found that HbA1c adds minimally to cardiovascular risk prediction in prediabetic population.
In Mexico, the National Survey for Health and Nutrition (ENSANUT) reported increasing prevalence rates of T2D, from in 2000 to in 2012. There is evidence that suggests that up to of the excess of diabetes observed on Mexican population could be attributed to environmental factors . The need to gather evidence based on scientific information to combat the T2D epidemic in this population is evident.
A robust description of the relationship between cardiovascular risk and hyperglycemic markers, controlled by relevant variables, is desirable for a better understanding of T2D.
An appealing technique to study the dependence among random variables is Kendall’s tau, which measures the ordinal association between variables and is defined by copula functions. Copulas are multivariate distributions with uniform marginals that link joint distributions with their univariate marginals. When the dependence among random variables is influenced by predictors, then the predictor dependent relation can be described by conditional copula functions. From here, a complete characterization of the association structure between the random variables can be given by conditional Kendall’s tau.
Bayesian nonparametric (BNP) models have been widely used, and their advantages, compared to their parametric counterparts, are abound. These models relax standard parametric assumptions, such as the choice of a particular family of distributions or the number of components that a mixture model requires. Hence, providing total flexibility and letting the data structure guide the inference. Due to its strengths, the use of BNP models for unconditional and conditional copula function estimation has increased during the last years.  proposed a BNP model for unconditional copula density estimation based on infinite mixtures of Gaussian copulas illustrating the flexibility of the BNP approach.  modeled the dependency between random variables by first modeling the marginals as infinite mixtures, and then linking them through a random Bernstein polynomial copula function.  modeled the copula function using Dirichlet polya trees. In the context of predictor-dependent copula function modeling,  extended the model proposed by  and defined an infinite mixture of conditional copula functions. The latter is flexible, but relies on the selection of a callibration function that relates the correlation structure of the copula function with the predictors.  proposed to model conditional copulas by flexibly modeling the calibration function with a Gaussian process prior. Though appealing, this still requires the selection of a specific copula, which leads to a model selection problem.
To overcome the aforementioned drawbacks, in this article, we present a fully BNP model for conditional copula density estimation. Namely, a prior distribution is placed on the family of conditional copula functions, here denoted by . Under such framework, predictor dependent copulas flexibly vary across predictors, thus the choice of a particular copula function is avoided and there is no need of a specific calibration function. Here, we consider a predictor-dependent stick-breaking prior distribution as proposed by , who extended the well known stick-breaking process prior proposed by  to accommodate for covariate dependency. The predictor dependence among stick-breaking prior, in such a framework enters through the corresponding weights and/or point masses. Fully BNP models based on the dependent stick-breaking and other extensions have been proposed by , , , , , among others.
Here, we present an approach that enebles the assessment of HbA1c as a cardiovascular risk marker in the Mexico City Diabetes Study (MCDS), controlled by relevant variables (). This study follows a cohort of low income elder urban participants, for whom surprising findings have been reported,  described subjects from this cohort who are normoglycemic (from an Oral Glucose Tolerance Test (OGTT)), despite the fact that they presented elevated values of HbA1c (). These results suggest that it is important to achieve a better understanding of the behavior of HbA1c as a potential cardiovascular risk marker in the MCDS, for varying levels of relevant covariates. To provide a flexible model to describe the conditional relation between hyperglycemic markers, such as HbA1c, and cardiovascular risk, we propose to model conditional copula functions as predictor dependent infinite mixtures of Gaussian copula functions with weights and correlation functions depending on predictors. Conditional copulas provide the complete characterization of conditional Kendall’s tau, which complements and gains better understanding of such relationship.
2 Background and Methods
Let and be measurements of cardiovascular risk and hyperglycemic markers of a normoglycemic patient , respectively. We consider the thickness of CIMT as cardiovascular risk marker, and HbA1c and PG2-H as hyperglycemic markers. In order to determine which hyperglycemic marker shows a stronger relationship with cardiovascular risk, we model the dependence structure between each one and the CIMT. Additionally, we consider that such dependence structure is affected by the triglycerides level, body mass index (BMI), age, and gender of patient
, encoded in a predictor vector.
It is well known that copula functions, introduced in the statistical literature by 
, describe the dependency structure between two or more random variables. Roughly speaking, copulas link the univariate marginal distribution function of random variables with their joint cumulative distribution function (c.d.f.). Since our interest is to model the dependency structure between pairs of random variables, in what follows we give a brief description of two-dimensional copulas.
A two-dimensional copula function, denoted , is a c.d.f. on such that , , for every , and for every set . According to Sklar’s theorem, the joint c.d.f., denoted , of two random variables, say and , with univariate marginals and , respectively, can be linked through a copula function, which is unique, when and are continuous. The relation is as follows:
. When the bivariate c.d.f. has a probability density function (p.d.f.), denoted, it holds that , where is the copula density and and are the p.d.f.s of and , repectively. A formal definition of copula functions, their properties, and more, can be found in .
Given the practicability for covariate inclusion, here we concentrate on the family of Gaussian copulas. Let
denote the c.d.f. of a bivariate normal distribution with zero mean vector and correlation parameterand let denote the inverse c.d.f. of a standard normal distribution, with . The Gaussian copula function, denoted , of the bivariate vector , with correlation parameter , is given by . The associated Gaussian copula density, denoted , is given by
where is the
-dimensional identity matrix and
Copula functions play a key role in the study of the association between random variables. In particular, one scale-invariant measure of association is Kendall’s tau, denoted . Specifically, Kendall’s tau measures the concordance between random variables, is defined by means of copulas, and given by
Our interest is to model the dependence structure between cardiovascular risk and hyperglycemic markers, controlled by relevant variables. Thus, we need the dependence structure to be dynamic for varying values of such variables. There is a rich literature on conditional copula functions and measures of association. See , , , , , , , , among others.
Conditional joint distributions are completely described by the conditional marginals and a conditional copula function, which is unique when the conditional marginals are continuous. According to Sklar’s theorem the relation is as follows , where is the predictor dependent copula function that describes the dependence structure between two random variables for varying values of the predictor . This, in turn, directly defines predictor dependent measures of association such as conditional Kendall’s tau, given by
As stated before, the purpose here is to robustly model predictor dependent copula functions to describe the predictor dependent association structure between cardiovascular risk and hyperglycemic markers, controlled by relevant covariates. Dependent BNP models provide a flexible tool for such data modeling framework.
We consider a dependent Dirichlet process mixture of bivariate Gaussian copula densities model with predictor dependent mixing distribution. Specifically, we utilize a dependent stick-breaking prior distribution for the set of predictor-dependent random measures. The model can be written as
where is defined as in (1) and the set has a dependent stick-breaking process prior such that, for each , the predictor dependent random measure is given by , with , where the stick-breaking variables, , and correlations, , are defined by transformations of real valued stochastic processes. Note that here, both weights and correlations are dynamic. To define and we exploit the relation between Gaussian processes and linear regression models, and consider suitable transformations. Note that the model in (3) can be equivalently written as
Under this approach the copula function is flexibly modeled through mixtures of bivariate Gaussian copula densities that vary and accommodate for different values of the predictor. We consider defined by the logistic transformation of the linear predictor, i.e., , with , . Similarly, the processes are defined by , with , where denotes a -dimensional normal distribution with parameters and a -dimensional positive definite matrix. A sufficient and necessary condition for the weights to add up to one in (4) is that for every , , see Lemma 1 in .
In a related approach, proposed by , only one random measure having a Dirichlet process prior is considered, and the predictor dependence is driven by a pre-specified calibration function. The fundamental difference is that in our approach, a prior distribution is placed on the collection of predictor dependent copula functions, allowing them to flexible vary for different values of the predictor.
The conditional association structure between variables of interest can be described by the covariate dependent Kendall’s tau, which as defined in (2), depends on the dynamic copula functions. An estimate of Kendall’s tau is given by
denotes the conditional copula estimate given by the predictor dependent mean of the posterior predictive copula function. Here, with an equally spaced grid of values for covariate , , and is an equally spaced grid of pairs in .
3 Model validation with simulated data
We illustrate the performance of the model in two simulation scenarios. For each simulation scenario, one sample of size 250 was generated from the bivariate joint distribution defined by a copula function and standard normal marginals. In order to have data points that closely approximate samples and from the copula densities, we consider the pseudo observations and , where denote the rank of among , , and denote the pairs generated from the joint distribution, as discussed by .
Under both scenarios, uniform on values were generated for the predictor. The first simulation scenario, named Scenario I, is defined by a predictor dependent elliptical copula with correlation parameter . Kendall’s tau for Scenario I is an increasing positive function of the predictor that varies from 0.006 to 0.910 for values of the predictor between 0.010 and 0.990. The second simulation scenario, named Scenario II, is defined by a mixture of two predictor dependent copula functions: a copula with correlation parameter and a Gumbel copula with parameter , and predictor dependent weights. Under this scenario, Kendall’s tau varies between 0.0 and 0.015 showing a very low positive association between the random variables. For more details on the simulation scenarios, see the Supplementary Material.
Under both simulated scenarios the specification of the model is completed by considering a truncation level of for the random measure. Since only one covariate is simulated,
, and the hyperparameters were set toand , where denotes the identity matrix. See the Supplementary Material for more details on the selection of these hyperparameters.
The Markov chain Monte Carlo (MCMC) posterior sampling scheme sequentially updates the coefficients of the linear predictors definingand with a multivariate slice sampler step. For a complete detail on the sampling scheme see the Supplementary Material. A single Markov chain of size 110,000 was generated for each simulated data set. Inferences are based on a reduced chain of size 10,000 obtained by saving one every 10 iterations, after a burn-in period of 10,000. Convergence of the posterior sample was evaluated with standard convergence test as implemented in the BOA R library .
Figure 1 displays the predictor dependent Kendall’s tau estimate (posterior median), its corresponding 95% credible bands, and the true predictor dependent Kendall’s tau for Scenario I and Scenario II. For both cases, the model is able to recover the dynamic association structure of the random variables. For Scenario I, Kendall’s tau is correctly estimated as a positive increasing function with increasing association structure, while for Scenario II Kendall’s tau is estimated as the almost constants function of the predictor it is, with very low values. Also, the true values of Kendall’s tau lie within the 95% credible band for every value of the predictor. See the Supplementary Material for additional results displaying the contour and surface plots of the copula density estimates and true copula density for selected values of the predictor.
The MCDS () study began in 1989 with the identification of a homogeneous low-income site. All men and non pregnant women aged 35 to 64 years were defined as eligible. The research protocol, informed consent, procedures, and methods were approved by the Institutional Review Board of the Center for Studies in Diabetes and all participants gave informed consent.
At baseline (1989 to 1990), a total of participants were interviewed and examined (from a total of eligible individuals). The final cohort (interviewed and examined) was composed of men and women. During the last phase (2008 to 2009), for a total of participants, in addition to the standardized exam and OGTT, an HbA1c measurement was included. T2D was diagnosed using ADA criteria: fasting plasma glucose concentration mg/dL and/or a 2-hour plasma glucose concentration mg/dL after a standard 75-g glucose load. Participants who self-reported a history of diabetes and were taking oral glucose-lowering agents were considered to have T2D, regardless of their plasma glucose values. Pre-diabetes (PD) was diagnosed when an individual had a fasting plasma glucose of to mg/dL and/or a 2-hour postglucose load between and mg/dL.
To contribute in the assessment of potential use of HbA1c as a cardiovascular risk marker, we explore the association between HbA1c and CIMT, which is related to atherosclerosis, controlled by triglycerides, BMI, age, and gender, and compare these results with those obtained from analyzing the association between PG2-H and CIMT, controlled by the same set of covariates. In our analysis, we consider the subset of normoglycemic individuals from the MCDS at the last phase of the MCDS who had the CIMT measured in phase 3 and who had measurements for all the considered covariates. Since the CIMT is prone to measurement error, values lower than the first quartile and larger than the third quartile were excluded. The resulting data sets containindividuals for each analysis. So, for each subject we observe pairs and the covariate , where denotes CIMT, denotes one of the hyperglycemic markers, and denotes the corresponding covariate measurement, which includes an intercept and the measurements of triglycerides, BMI, age (discretized into four categories), and gender. As for the simulated data sets, we consider the pseudo observations and as described in Section 3, in our analysis.
|Female sex||145 (58.9)|
For both real data sets the specification of the model is completed by considering a truncation level of for the random measure. Since the predictor age was discretized into four categories the real data sets contain predictors. Here, the mean hyperparameters were set to and the covariances and were defined by block diagonal matrices of the form
where and denote the design matrices based on the continuous and discrete covariates, respectively (). The scaling constants and , , control the linear predictor so that the stick-breaking variables range between 0.02 and 0.99 and the correlations range between -0.70 and 1.00, for every value of the predictor, respectively. For more details on the selection of these hyperparameters, see the Supplementary Material.
As in the simulation scenario, the MCMC posterior sampling scheme sequentially updates the coefficients of the linear predictors defining and with a slice sampler step. A single Markov chain of size 300,000 was generated for each real data set and inferences are based on a reduced chain of size 5,000 obtained after discarding the first 200,000 iterations and saving one every 20 iterations. Again, convergence of the posterior sample was evaluated with standard convergence test as implemented in the BOA R library .
Figures 2 - 4 display conditional Kendall’s tau estimate (posterior median) for normoglycemic participants between CIMT and PG2-H, for triglycerides in a range between and three different levels of BMI, namely and . They show that the association between PG2-H and CIMT is a slightly decreasing positive function of the triglycerides levels. Although the association is relatively small it is slightly bigger for older patients and this behavior is stable across different levels of BMI.
In contrast, as shown in Figures 5 - 7, the association between HbA1c and CIMT is a positive, moderately decreasing function of the triglycerides, with very small values for BMI level of 22. However, for BMI levels of 27 and 32 it shows an increasing association with a bigger slope for BMI levels of 32 and stable across age. This would suggest that if any of these hyperglycemic markers should be considered as an indicator of cardiovascular risk, the HbA1c could be a better indicator than the PG2-H for this population. This result is consistent with the findings reported by .
Understanding T2D and its relationship with other conditions is an important and complex task. Rates of T2D are increasing world wide and in particular, for the Mexican population. The assessment of cardiovascular risk through hyperglycemic markers is relevant and HbA1c is an accessible marker which has shown to be a stronger indicator than PG2-H, in normoglycemic subjects of the MCDS.
In this article we have presented a general and flexible model for describing the association structure between hyperglycemic markers and cardiovascular risk, controlled by triglycerides, BMI, age, and gender. Our methodology is based on a fully BNP approach for modeling conditional copula densities, and as a byproduct the conditional association structure is described by the conditional Kendall’s tau. This way, a direct understanding of this association, adjusted by known risk factors, is feasible without the need to define thresholds in either, the CIMT, HbA1c or PG2-H. While small, the behavior of the association suggests a moderate contribution of these hyperglycemic markers to cardiovascular risk assessment, when used as such HbA1c performs better.
6 Data Accessibility and Software
The MCDS data set analyzed in this paper will be publicly available after acceptance, together with the full code and documentation.
Claudia Wehrhahn was partially funded by award NSF-DMS 1622444 and CONACyT Project 241195. Ruth Fuentes-García would like to acknowledge the support of PAPIIT IN118720, UNAM. Ramsés H. Mena gratefully thanks the hospitality of the University of Bath, this work was conducted while he held a Global Chair Visiting position at that institution. Fabrizio Leisen acknowledges the support by the European Community’s Seventh Framework Programme [FP7/2007-2013] under Grant agreement No: 630677. Maria Elena González-Villalpando and Clicerio González-Villalpando want to express their gratitude to all the participants in the community, who allowed the Mexico City Diabetes Study investigators to enter their home. This generous attitude is recognized.
Semiparametric estimation of conditional copulas.
Journal of Multivariate Analysis110, pp. 43–73. Cited by: §2.
-  (2011) Dependence calibration in conditional copulas: a nonparametric approach. Biometrics 67 (2), pp. 445–453. Cited by: §2.
-  (2017) Fully nonparametric regression for bounded data using dependent bernstein polynomials. Journal of the American Statistical Association 112 (518), pp. 806–825. Cited by: §1.
-  (2014) Copula based factorization in bayesian multivariate infinite mixture models. Journal of Multivariate Analysis 127, pp. 200–213. Cited by: §1.
-  (2012) The worldwide epidemiology of type 2 diabetes mellitus-present and future perspectives. Nature reviews endocrinology 8 (4), pp. 228–236. Cited by: §1.
-  (2009) Bayesian nonparametric non-proportional hazards survival modelling. Biometrics 65, pp. 762–771. Cited by: §1.
-  (2004) An ANOVA model for dependent random measures. Journal of the American Statistical Association 99, pp. 205–215. Cited by: §1.
-  (2005) Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association 100, pp. 1021–1035. Cited by: §1.
-  (2009) Estimating copula densities through wavelets. Insurance: Mathematics and Economics 44 (2), pp. 170–181. Cited by: §3.
-  (2012) Multivariate and functional covariates and conditional copulas. Electronic Journal of Statistics 6, pp. 1273–1306. Cited by: §2.
-  (2011) Conditional copulas, association measures and their applications. Computational Statistics & Data Analysis 55 (5), pp. 1919–1932. Cited by: §2.
-  (1999) The mexico city diabetes study: a population-based approach to the study of genetic and environmental interactions in the pathogenesis of obesity and diabetes. Nutrition reviews 57 (5), pp. 71–77. Cited by: §1, §4.
-  (2014) Risk factors associated to diabetes in mexican population and phenotype of the individuals who will convert to diabetes. Salud Pública de México 56, pp. 317–322. Cited by: §1, §1.
-  (2010) Efficient estimation of a semiparametric dynamic copula model. Computational Statistics & Data Analysis 54 (11), pp. 2609–2627. Cited by: §2.
-  (2011) Glycated hemoglobin a1c, fasting plasma glucose, and two-hour postchallenge plasma glucose levels in relation to carotid intima-media thickness in chinese with normal glucose tolerance. The Journal of Clinical Endocrinology & Metabolism 96 (9), pp. E1461–E1465. Cited by: §1, §4.
-  (2001) Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96, pp. 161–173. Cited by: §2.
-  (2002) A copula-based model for multivariate non-normal longitudinal data: analysis of a dose titration safety study on a new antidepressant. Statistics in medicine 21 (21), pp. 3197–3217. Cited by: §2.
-  (2017) Bayesian nonparametric conditional copula estimation of twin data. Journal of the Royal Statistical Society: Series C (Applied Statistics). Cited by: §1, §2.
-  (2018) Bayesian inference for conditional copulas using gaussian process single index models. Computational Statistics & Data Analysis 122, pp. 115–134. Cited by: §1.
-  (2017) Diabetic by hba1c, normal by ogtt: a frequent finding in the mexico city diabetes study. Journal of the Endocrine Society 1 (10), pp. 1247–1258. Cited by: §1.
-  (2000) Dependent Dirichlet processes. Technical report Department of Statistics, The Ohio State University. Cited by: §1.
-  (1999) Introduction. In An Introduction to Copulas, pp. 1–4. Cited by: §2.
-  (2018) A nonparametric bayesian approach to copula estimation. Journal of Statistical Computation and Simulation 88 (6), pp. 1081–1105. Cited by: §1.
-  (2006) Modelling asymmetric exchange rate dependence. International economic review 47 (2), pp. 527–556. Cited by: §2.
-  (2011) Nonparametric bayesian models through probit stick-breaking processes. Bayesian analysis (Online) 6 (1). Cited by: §1.
-  (1994) A constructive definition of Dirichlet prior. Statistica Sinica 2, pp. 639–650. Cited by: §1.
-  (1959) Fonctions de répartition à n dimensions et leurs marges. publications de l’institut de statistique de l’université de paris. Cited by: §2.
-  (2007) BOA: An R package for MCMC output convergence assessment and posterior inference. Journal of Statistical Software 21, pp. 1–37. Cited by: §3, §4.
-  (2011) Estimation of a conditional copula and association measures. Scandinavian Journal of Statistics 38 (4), pp. 766–780. Cited by: §2.
-  (2020) Glycated hemoglobin, prediabetes, and the links to cardiovascular disease: data from uk biobank. Diabetes Care 43 (2), pp. 440–445. Cited by: §1.
-  (2015) Bayesian nonparametric estimation of a copula. Journal of Statistical Computation and Simulation 85 (1), pp. 103–116. Cited by: §1.
On assessing prior distributions and bayesian regression analysis with g-prior distributions. Bayesian inference and decision techniques. Cited by: §4.