Fully integrative data analysis of NMR metabolic fingerprints with comprehensive patient data: a case report based on the German Chronic Kidney Disease (GCKD) study

10/08/2018 ∙ by Helena U. Zacharias, et al. ∙ 0

Omics data facilitate the gain of novel insights into the pathophysiology of diseases and, consequently, their diagnosis, treatment, and prevention. To that end, it is necessary to integrate omics data with other data types such as clinical, phenotypic, and demographic parameters of categorical or continuous nature. Here, we exemplify this data integration issue for a study on chronic kidney disease (CKD), where complex clinical and demographic parameters were assessed together with one-dimensional (1D) 1H NMR metabolic fingerprints. Routine analysis screens for associations of single metabolic features with clinical parameters, which requires confounding variables typically chosen by expert knowledge to be taken into account. This knowledge can be incomplete or unavailable. The results of this article are manifold. We introduce a framework for data integration that intrinsically adjusts for confounding variables. We give its mathematical and algorithmic foundation, provide a state-of-the-art implementation, and give several sanity checks. In particular, we show that the discovered associations remain significant after variable adjustment based on expert knowledge. In contrast, we illustrate that the discovery of associations in routine analysis can be biased by incorrect or incomplete expert knowledge in univariate screening approaches. Finally, we exemplify how our data integration approach reveals important associations between CKD comorbidities and metabolites. Moreover, we evaluate the predictive performance of the estimated models on independent validation data and contrast the results with a naive screening approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of new omics technologies, offering high coverage at an affordable price, has changed the landscape of large-scale clinical studies. More and more population-based trials collect not only information about phenotypes, traditional clinical parameters, and demographic variables, but also extensive omics data, e.g. KORA [1, 2], and TwinsUK [3]. This typically results in a large set of complex patient data, which needs to be statistically analyzed.

One example of such a large-scale, population-based study is the German Chronic Kidney Disease (GCKD) study, comprising about 5200 chronic kidney disease (CKD) patients. CKD constitutes one of the largest burdens on the world’s health system [4], with about 200 cases per million per year in many countries [5], and a global occurrence of 50% in high-risk subpopulations [6]. As it progresses, CKD leads to a large number of adverse clinical symptoms, and in some cases may progress to complete renal failure, called end-stage renal disease (ESRD) [7]. It is linked to elevated all-cause and cardiovascular mortality, acute kidney injury (AKI), cognitive decline, anemia, mineral and bone disorders, and fractures [4, 7, 8]. Moreover, it is often accompanied by numerous other systemic diseases, such as cardiovascular disease, hypertension, obesity, and diabetes [6, 9, 10].

The GCKD study was designed to improve our understanding of the causes, course, and risk factors of progressive kidney insufficiency [11]. To that end, demographic, phenotypic, and clinical parameters of approx. 5200 CKD patients were assessed [11, 12]. Additionally, NMR metabolic fingerprints of plasma specimens collected at the study baseline were acquired for almost the complete cohort. Metabolic markers are expected to better reflect the current state of the kidney than, e.g., genomic, transcriptomic, or proteomic markers [13]. In total, approximately 900 parameters of diverse data sources, which potentially have complex dependencies amongst each other, have been assembled. Here, we aim to identify, in the best case yet unknown, relationships between these complex layers of information. Particularly, we aim to uncover metabolic features related to CKD and its diverse comorbidities.

Data integration tries to detect such relationships. However, it presents both a conceptual and a practical challenge. Conceptually, we are faced with complex, often synergistic effects between variables within and across different layers of information, where the variables are potentially of different data types. Practically, we have to deal with a large-scale dataset covering close to thousand variables and several thousand measurements, leaving the researcher with a lot of possible hypotheses that require extensive validation. The practical and conceptual issues are inseparably connected. Assuming more complex and realistic models requires more parameters and their estimation becomes computationally more and more challenging.

Numerous methods have been proposed to analyze complex datasets, ranging from metabolome-wide association studies and multivariate statistical analysis to data-driven network-based approaches [14, 15]

. Metabolome-wide association studies are applied routinely to identify associations between metabolites and phenotypes, but they inherently ignore complex, multivariate relationships between variables. Approaches based on probabilistic graphical models can reveal complex relationships, but they are usually limited to one specific data type, e.g., Gaussian Graphical Models (GGMs) for continuous (Gaussian) variables or discrete Markov Random Fields (dMRF) for categorical variables

[16]. More recently, probabilistic graphical models have been extended to include different data types simultaneously [16, 17], although such methods have not yet entered biomedical research.

Univariate screening approaches ignore effects of confounding variables. These can be either incorporated by correcting the data or by adapting the statistical test. Formally, this corresponds to estimating the partial correlation coefficient , which denotes the partial correlation between and given the confounders . If , then there is an association between and given , where the size and sign of reflects the strength and sign of an association, respectively. An inherent problem in the estimation of is that it is not a priori clear which variables to include in . To solve this issue, we need to estimate , where includes all assembled variables except and , ideally containing the complete confounder set.

Here, we estimate , where , , and are the comprehensive patient data assembled within the GCKD study. The included variables are either continuous or categorical. Thus this analysis requires the estimation of a so-called Mixed Graphical Model (MGM) [16, 17]. We first show how this integrative analysis reveals known relationships between variables. Second, we illustrate that the discovery of associations in univariate screening approaches is biased by incorrect or incomplete expert knowledge. Third, we demonstrate that our data integration approach overcomes this issue. Finally we give an example, where important associations between CKD comorbidities and metabolites are revealed. Here, we show that the discovered associations remain significant after variable adjustment based on expert knowledge. Throughout the article, we substantiate our findings by evaluating the predictive performance of the estimated model on validation data.

2 Materials and methods

2.1 Mixed Graphical Models

MGMs are undirected probabilistic graphical models, where the conditional dependencies between different variables, the so-called nodes or vertices, are represented as edges in a network. Thus, two nodes are connected by an edge only if their association or interaction cannot be explained by any other node in the graphical model, or equivalently, if their partial correlation coefficient is unequal zero. Consequently, probabilistic graphical models eliminate spurious associations between variables and can potentially reveal new associations adjusted for all other variables in the dataset.

2.2 Algorithmic implementation

For the current dataset, which included variables and patients (measurements), we needed to estimate, in total, edge weights.

Therefore, we implemented an efficient algorithm to estimate MGMs as described in the supplementary methods section 6.5. Here, we used the pseudo-log-likelihood method of [17], made use of proximal algorithms to include a LASSO penalty on the parameters [18], and we performed the proximal gradient descent using Nesterov acceleration and adaptive restarts [19, 20]. The respective Python scripts to estimate MGMs are available upon request.

2.3 Data integration workflow

A schematic illustration of our data integration workflow is shown in Figure 1. Starting point is the GCKD study, Figure 1a. Here, we included a total of 3705 GCKD study participants. The remaining study participants had to be excluded due to missing demographic, clinical, and/or metabolic data. We generated a training set of 2470 study participants, corresponding to 2/3 of the complete cohort, and a test set of 1235 study participants, respectively, compare to Figure 1b. For the current study, we included 17 clinical chemistry parameters, 73 demographic parameters, and 46 different drug treatments, all assessed at the study baseline. Each of the three different information layers is represented by a blue, orange, and cyan schematic data matrix in Figure 1b, respectively. These parameters, their distributions, assessment, as well as transformation information are given in Supplementary Table S1. For each study participant, a one-dimensional (1D) H nuclear magnetic resonance (NMR) spectrum of a baseline EDTA plasma specimen was acquired. In total, 743 metabolic features were extracted from the spectra, as described in the supplementary methods section 6.4. This information layer is represented by the red data matrix in Figure 1b. In summary, the complete variable space included 879 variables, with 768 continuous and 111 discrete variables, respectively.

Figure 1: Scheme of the Mixed Graphical Model (MGM) data integration approach. (a) Of the data ascertained from the GCKD study population, (b) a total of 17 clinical chemistry parameters (blue), 73 demographic parameters (orange), 46 drug treatment parameters (cyan), and 743 NMR spectral features (red) were chosen. The complete dataset was split into a training and a test cohort, respectively. The first (c) was used to estimate an MGM, modeling all conditional dependencies between all variables, whereas the latter (d) was used for MGM model validation. In the network representation of the estimated MGM, blue nodes represent clinical chemistry parameters, orange nodes represent demographic variables, cyan nodes represent drug treatment information, and red nodes correspond to NMR buckets. Continuous variables are represented as circles and discrete variables as rectangles. Positive and negative associations are shown as blue and red edges, respectively. The strength of the association, i.e., the weight of the corresponding coefficient, is encoded by the edge width.

Next, the MGM was estimated on the training data, Figure 1c, and validated on independent test data, Figure 1d. Here and throughout the article, clinical chemistry parameters are represented as blue nodes, demographic variables as orange nodes, drug treatment information as cyan nodes, and NMR buckets as red nodes. Positive and negative associations are shown as blue and red edges, respectively, where the edge width encodes the absolute edge strength. Continuous variables are shown as circles and discrete variables as rectangles.

3 Results

3.1 MGM data integration reveals known associations and combines them to reliable prediction models

First, we will present several sanity checks for our MGM approach. We will illustrate how the MGM reveals known associations in the context of the two renal function markers urinary albumin-to-creatinine ratio and the glomerular filtration rate, for two frequent comorbidities, i.e. diabetes and gout, and for a lifestyle factor, i.e. alcohol consumption.

Urinary albumin-to-creatinine ratio

The urinary albumin-to-creatinine ratio (UACR) and the glomerular filtration rate (GFR) are the two most important markers of renal function routinely assessed in CKD patients. Figure 2a visualizes the first order neighborhood of UACR. The first order neighborhood of a node comprises all other nodes that have a direct association/edge with this particular node. The MGM trained on the training set connects UACR with urinary albumin (ua) by a strong positive edge, and with urinary creatinine (ucrea) by a strong negative edge, respectively. This correctly reflects the definition of UACR, .

A third positive edge appears between ua and ucrea. This edge can be explained within the conditional dependence framework: two variables are conditionally dependent on each other, if both variables simultaneously change upon holding all other variables fixed. Let the value of UACR be fixed and change the value of ua. In order to fulfill the requirement that the UACR stays fixed, urinary creatinine has to change accordingly to compensate the changes of urinary albumin and, thus, a positive edge appears between ua and ucrea.

We further observe positive edges between severe proteinuria (proteinuria 300mg/dL) and ua as well as UACR, respectively, correctly stating that patients with high urinary levels of albumin and accordingly high UACR values are diagnosed with proteinuria. Correspondingly, absent or mild proteinuria (proteinuria 30mg/dL) is negatively associated with urinary albumin and UACR. Here, intermediate proteinuria (30mg/dL proteinuria 300mg/dL) was considered as the baseline.

In a second step, we validated the predictive performance of the detected neighborhood. For this purpose, we used the respective edge weights as a linear signature that predicts UACR. As shown in Figure 3a, the identified neighborhood is almost perfectly predictive of UACR on independent test data with a correlation coefficient between true and predicted UACR values of .

Figure 2: (a) First order neighborhood of urinary albumin-to-creatinine-ratio (UACR). The first order neighborhood of a node comprises all other nodes which have a direct association/edge with this particular node. Positive associations are represented as blue, negative associations as red edges, respectively. The strength of the estimated association is encoded by the edge width. The edges are ordered according to their strength in a clock-wise manner for positive, and in an anti-clock-wise manner for negative associations, respectively. UACR is positively associated with urinary albumin (ua) (edge weight = 22.85) and negatively associated with urinary creatinine (ucrea) (edge weight = -7.19), respectively. The positive edge between ua and ucrea (edge weight = 7.31) can be explained within the conditional dependence framework (see text). Two levels of proteinuria are connected to both UACR and ua, respectively. High level of protein in the patient’s urine (proteinuria 300mg/dL) is positively correlated, whereas proteinuria of less than 30mg/dL (proteinuria 30mg/dL) is negatively correlated with UACR (edge weight = 0.573 and edge weight = -0.564, respectively) and ua (edge weight = 2.873 and edge weight = -3.53, respectively), respectively. Moderate proteinuria (30mg/dL proteinuria 300mg/dL) was considered as the baseline. (b) First order neighborhood of eGFR values estimated by the CKD-EPI equation based on serum creatinine (). is strongly negatively associated with serum creatinine () (edge weight = -11.185), strongly positively associated with male gender () (edge weight = 7.514), and negatively associated with age () (edge weight = -2.713). Negative associations are revealed between and serum cystatin C values () (edge weight = -0.764), and the NMR bucket at 3.045ppm (edge weight = -0.374), corresponding to creatinine, respectively. A complete list of all abbreviations for the clinical parameters can be found in Supplementary Table S1.
Glomerular filtration rate (GFR)

The GFR defines the fluid volume filtered by the glomeruli per unit time. Several formulas are available to estimate the GFR. Here, we included estimated GFR (eGFR) values calculated using the CKD-EPI equation [21].

CKD-EPI takes as input serum creatinine (), age, gender, and race. In Figure 2b, we show the first order neighborhood of eGFR, which correctly identified age, gender, and serum creatinine. Race was not included as a variable, as all GCKD study participants were Caucasian, and, thus, the respective connection could not be observed.

In addition, we observed associations of eGFR with cystatin C (cysC), and NMR buckets assigned to creatinine ( ppm, ppm, and ppm). Serum cystatin C, besides serum creatinine, is used as an endogenous marker to estimate GFR, e.g. in the CKD-EPI equation based on cystatin C [22], which explains the found association between cystatin C and eGFR.

In Figure 3b, we show how the neighborhood of eGFR predicts eGFR on test data. We observe that values agree with a correlation coefficient of .

Figure 3: (a) The diagram shows the predictions of UACR on the -axis (in standard units [su]) based on the neighbors of UACR, compare to Figure 2a, on independent test data compared to the true values plotted on the -axis. (b) Shows the analogous plot for eGFR. In both scenarios, predictions agree almost perfectly with the true values as indicated by the correlation coefficients between true and predicted values given in the lower right corners. The receiver operating characteristic (ROC) curve for predicting elevated blood sugar based on its neighborhood, as shown in Figure 4a, is shown in (c). The -axis here represents the false, whereas the

-axis represents the true positive rate, respectively. The dashed line gives the diagonal, corresponding to the predictive performance of a randomly generated model. In the lower right corner, the area under the ROC curve (AUC), an indicator of the predictive power of a classifier, is given. A perfect classifier would achieve an AUC of 1 on independent test data, whereas a randomly generated classifier with no predictive power would achieve an AUC of 0.5, respectively. (d) shows the ROC curve for the neighborhood model of the medical diagnosis of a patient as being diabetic.

Diabetes

One of the most common comorbidities of CKD is type-2 diabetes mellitus (T2DM). Untreated T2DM is characterized by high blood glucose. In this study, blood sugar is included as the discrete variable “elevated blood sugar yes/no” (bl_sug).

The MGM correctly identifies NMR buckets corresponding to D-glucose signals, namely 3.785 ppm, 3.865 ppm, 3.875 ppm, and 3.765 ppm, in the first-order neighborhood of “elevated blood sugar yes/no”, Figure 4a. The three strongest connections are between elevated blood sugar and diabetes medications (med_dm), the classification as diabetes patient (diabetic), and diabetic nephropathy (diab_neph), respectively.

In Figure 4b, we show the first-order neighborhood of , which exhibits a strong connection to diabetes medications (med_dm) and glycated hemoglobin (HbA1c) (hba1c). This reflects the definition of a diabetic patient within this study: a patient is defined as diabetic, if he had an HbA1c value 6.5% or took at least one medication with an active component classified as “A10”, i.e. insulin and other diabetes medications. HbA1c reflects the two to three month average plasma glucose level [23] and is here associated with elevated blood sugar independently of diagnosed diabetes.

Other strong connections were observed between blood sugar and retinal laser therapy due to diabetes (ret_las), as well as classification as T2DM patient (dm_typ2). All three variables are positively connected to diabetic nephropathy (diab_neph), as a study participant is stated to suffer from diabetic nephropathy, if he/she suffers from type-1 or type-2 DM or from other diabetic nephropathies.

The corresponding neighborhoods of bl_sug and diabetic are highly predictive. The areas under the receiver operating characteristic (AUC-ROC) curves are and on the test data, as shown in Figure 3c and d, respectively.

Figure 4: (a) First order neighborhood of elevated blood sugar (bl_sug) and (b) classification as diabetic patient (diabetic). Strong associations can be observed between bl_sug and diabetes medications () (edge weight = 1.52), (edge weight = 1.27), and diabetic nephropathy () (edge weight = 1.15), respectively. Other strong associations are present between and med_dm (edge weight = 2.89), and the HbA1c value () (edge weight = 2.366), respectively, as well as between 2 NMR buckets at 3.785ppm (edge weight = 0.136) and 3.865ppm (edge weight = 0.134), both corresponding to D-glucose, and bl_sug, and between and classification as type-2 DM patient (dm_typ2) (edge weight = 7.1) and retinal laser therapy due to diabetes () (edge weight = 0.47), respectively.
Gout

Gout is a common comorbidity in CKD patients [24, 25]. The first order neighborhood of gout () as shown in Figure 5 comprises a number of clinical and metabolic variables. This phenotype exhibits strong positive associations with the NMR buckets at 8.115 ppm and 8.125 ppm corresponding to unidentified small peptides. An NMR bucket at 3.565ppm, which could be assigned to glycine, is strongly negatively associated with gout.

Glycine has been reported to negatively correlate with gout in other metabolic studies [26]. It is involved in purine metabolism and is a precursor of uric acid [26]. Decreased plasma levels of glycine in patients with gout might be caused by the increased production of uric acid in these patients [26]. Note that uric acid has also been positively associated with gout in our study.

Interesting associations between gout and clinical variables are present for high alcohol consumption (positive association), analgetic nephropathy () (positive association), bmi (positive association), microscopic polyangiitis () (negative association), anti-dementia medication () (positive association), waist-hip ratio () (positive association), and Morbus Wegener (negative association). Alcohol consumption, age (here positively associated with gout), male gender (here weakly positively associated with gout), loop diuretica (here positively associated), and obesity are known risk factors of gout [27, 28].

The first order neighborhood of gout is predictive on independent test data with an AUC of 0.748, compare to Figure 6a.

Alcohol consumption

The assessment of alcohol consumption in the GCKD cohort was based on a questionnaire about weekly alcohol intake. Study participants, who had stated to consume alcohol not more than twice a week were classified as “little or normal alcohol consumption”, here treated as the baseline level, and all other study participants were classified as “high alcohol consumption”. Supplementary Figure 12 displays the first order neighborhood of “high alcohol consumption”. Interestingly, high alcohol consumption is strongly positively associated with male gender, acute renal failure (), high density lipoprotein (), an NMR bucket at 1.145 ppm, containing broad lipid signals mainly identified as low density lipoprotein (LDL), two unidentified NMR buckets at 0.835 and 2.755 ppm, and an NMR bucket at 2.675 ppm, identified as citric acid (negatively correlated). Further, it is positively associated with an NMR bucket at 1.925 ppm, identified as acetate (major signal) and minor contributions from arginine and lysine, and an NMR bucket at 3.275 ppm, identified as trimethylamine-N-oxide (TMAO) and minor contributions from D-glucose and betaine.

Levels of large and medium-sized HDL as well as large LDL particles have been reported to be elevated by alcohol consumption [29, 30]. It is also reported that alcohol consumption is generally higher in males than in females [30]. Citric acid has been reported to negatively correlate with alcohol consumption in other studies [31].

As shown in Figure 6b, the first order neighborhood of high alcohol consumption is predictive on independent test data with an AUC of 0.796.

Figure 5: First order neighborhood of gout (). Strong positive associations between this phenotype and the NMR bucket at 8.115 ppm (edge weight = 0.272), corresponding to unidentified small peptides, alcohol (edge weight = 0.202), 8.125 ppm (edge weight = 0.194) (unidentified small peptides), as well as analgetic nephropathy (an_neph) (edge weight = 0.169), and strong negative associations with the NMR bucket at 3.565 ppm (edge weight = -0.16), identified as glycine, can be observed. Gout is also connected to bmi (edge weight = 0.15), anti-dementia medication (med_antidemenz) (edge weight = 0.14), waist-hip ratio () (edge weight = 0.13), and Morbus Wegener (morb_weg) (edge weight = -0.13).
Figure 6: Receiver operating characteristics (ROC) curves for predicting (a) gout, (b) high alcohol consumption, (c) cardiac arrhythmia (), and (d) cardiac infarction () based on their respective first order neighborhoods on independent test data. The corresponding area under the ROC curve (AUC) values are given in the lower right corners.

3.2 Associations in univariate screening approaches can be biased by incorrect or incomplete expert knowledge, while MGMs intrinsically account for confounding variables

Here, we compare two different approaches to screen for associations, i.e. a standard univariate regression and the MGM data integration approach. We will illustrate that the associations discovered by the MGM are more robust for post hoc confounder adjustment than univariately screened associations.

3.2.1 Univariate screening

We first screened for univariate associations between variables by regressing each variable on every single remaining variable in the dataset. We performed linear or logistic univariate regression analysis, depending on whether the response variable followed a Gaussian or a binomial distribution, respectively. For each response variable, we ranked the univariate predictor variables according to their association strength in terms of significance, here

. The predictor with the largest was considered as the “top association”. Figure 7a column “top assoc.” shows the distribution of of all top associations.

Next, we repeated the regression analysis, but now in a multivariate fashion, simulating a regression analysis with confounder adjustment. We regressed each variable on its ”top associated” variable and on the next five most significant predictors simultaneously, and recorded the of the ”top associated” variable. The corresponding distribution is shown in Figure 7b “top assoc.”. We observe that the associations appear substantially weaker after variable adjustment. This trend is also visible from Figure 9a, where we contrast adjusted (y-axis) with unadjusted (x-axis) . Here, we obtained almost exclusively values in the upper half plane, see also Figure 10a. Finally, we regressed each response variable on its ”top associated” predictor and on five additional variables randomly drawn out of the next top 10 univariate predictors. The corresponding distribution can be found in Figure 7c “top assoc.”. Again, of the top associations in this multivariate regression scenario appear much weaker than in the univariate screening of Figure 7a.

3.2.2 Associations in MGMs are intrinsically corrected for confounding variables

We performed an analogous analysis for our MGM approach, but now, we ranked the predictors of each response variable according to their edge weights (from largest to lowest absolute value). The predictor with the largest absolute coefficient was considered as the “top neighbor”. In a post-hoc analysis, we regressed each response variable on its top MGM neighbor, and show the distribution of corresponding in Figure 7a, column “top neighbor”. Figure 7b, column “top neighbor” shows for these top MGM neighbors, but now adjusted for the next top 5 neighbors in the respective MGM neighborhood. We observe that the significance decreases, as previously for the univariate screening, albeit, this effect is less pronounced (Figure 7b). If we adjust for the same randomly drawn features as for the univariate approach, we obtain the results in Figure 7c, which show the same trend. For Figure 8a to c, we subtracted the left values (“top assoc.”) from the right values (“top neighbor”) in Figure 7a to c, respectively. In addition, we contrast those differences with their respective rank as red points. The -axis now corresponds to the rank percentiles, starting with the most negative difference at , and the highest positive difference at , respectively.

In Figure 8a, all differences are at or below zero. Approx. 42 of the difference values are negative (see green highlighted area), and the remaining differences are zero. After confounder adjustment, Figure 8b, the differences are predominantly positive. Considering the rank distribution, 70 of the difference values are positive (highlighted by the violet area), and only 30

are negative (highlighted by the green area). This indicates, on average, a higher significance of the top features selected by the MGM after variable adjustment. Figure,

8c shows the corresponding results after an adjustment for the same randomly selected features for “top assoc.” and “top neighbor”. Now, 30 of the differences are positive, only 12 are negative, and most of them are equal to zero.

Figure 7: Effects of variable adjustment on univariate or MGM association analysis in the training set. (a) “top assoc.” shows the distribution of derived from a univariate regression. Here, we calculated -values between all possible pairs of variables and collected all top associations. “top neigh.” shows the analogous distribution, where the top feature was selected by largest absolute edge weight in the MGM neighborhood. (b) The corresponding plot, where the -values were corrected by the top five confounder variables of the univariate and MGM screening, respectively. (c) The corresponding plots, where we adjusted for the same five randomly selected features for both methods.
Figure 8: (a) to (c) show the differences between “top neigh.” and “top assoc.” in Figure 7a to c, respectively: (a) shows the of the MGM approach minus those of the univariate screening in Figure 7a, (b) shows the corresponding plot after adjusting for the respective top confounders, as shown in Figure 7b, and (c) shows the corresponding plot after adjusting for the randomly selected confounders, as shown in Figure 7c. The red points in each figure contrast the values on the y-axis with their respective rank. On the x-axis, the highest positive difference corresponds to and the most negative to . The green shaded areas correspond to rank percentiles of negative, the violet shaded areas correspond to rank percentiles of positive differences, respectively.

3.2.3 MGMs can reveal associations which remain hidden or underestimated in univariate approaches

In Figure 9b we contrast adjusted with unadjusted values of for the MGM approach. We observe that several variables are stronger associated with each other after variable adjustment compared to the unadjusted scenario. This is in contrast to the univariate approach, Figure 9a, where almost all values were located in the upper half plane. A comparison between unadjusted and randomly adjusted for the univariate and the MGM approach can be found in Supplementary Figure 13.

An interesting example, how an association is revealed in a multivariate context, can be observed in the context of gout. One of the most important associations in the neighborhood of gout is alcohol, which was only ranked at position in the univariate screening. Gender, in contrast, whose association is on rank in a univariate screening, appears only at position in the MGM. Thus, we can speculate that the prevalence between gout and male gender can be explained to a large extend by the higher alcohol consumption in males.

Plots corresponding to Figures 7, 8, and 9 for the validation data are shown in Supplementary Figures 14, 15, and 16, and strongly support our observations.

Figure 9: Smooth scatter plot of adjusted -values (x-axis) versus unadjusted (univariate) -values (y-axis) for the univariate screening (a), and the MGM (b) in the training set. The red rectangle marks an excerpt shown in detail in Figure 10.
Figure 10: The region marked by a red rectangle in Figure 9 for the univariate screening (a), and for screening by the MGM (b) in the training set. Adjusted -values are shown on the x-axis, unadjusted -values on the y-axis.

3.3 TMAO is associated with cardiac infarction and cardiac arrhythmia

Cardiovascular diseases and complications are typical comorbidities and outcomes in patients with CKD. Here, we focus on the first order neighborhood of two common phenotypes in CKD patients, i.e., cardiac arrhythmia (card_arr) and cardiac infarction (card_inf), shown in Figure 11a and b, respectively. Both variables are explained by a number of demographic and drug information parameters.

Patients who were diagnosed with cardiac arrhythmia, compare to Figure 11a, received vitamin K antagonists () (positively associated), had experienced a heart failure (card_ins) (positively associated) and mitral valve insufficiency (mit_ins) (positively associated), experienced angina pectoris (br_pain) (positively associated), and dyspnea during physical strain () and during the night () (both positively associated), got a diagnosis of other heart valve anomalies () (positively associated), and received antithrombotic drugs () (positively associated).

Patients, who suffered a cardiac infarction, see Figure 11b, were likely (positively associated) to have received a diagnosis of coronary angiopathy (cor_ves_enl), to have undergone cardiac surgery (), were less likely to have been diagnosed with aortic valve stenosis () (negatively associated), suffered from CKD due to acute renal failure events (acute_fail) (positively associated), were likely to have experienced angina pectoris () and heart failure (card_ins), and to have received antiplatelet therapy (). They were also likely to have underwent a catheter angiography of peripheral arteries including an angioplasty of a peripheral artery (), to suffer from mitral valve insufficiency (), or a stroke (), and they also had lower serum cholesterol levels () (negatively associated).

Most interestingly, both cardiac infarction and cardiac arrhythmia were positively associated with an NMR bucket at 3.275 ppm, which could be identified as trimethylamine-N-oxide (TMAO) and minor signals from D-glucose and betaine.

Increased plasma levels of TMAO have just recently been described as a marker for atrial fibrillation independent of hypertension, BMI, smoking, diabetes, or intake of total choline [32], and have been associated with a higher incidence of major adverse cardiovascular events (death, myocardial infarction, or stroke) [33].

Figure 11: First order neighborhood of (a) cardiac arrhythmia () and (b) cardiac infarction (). is strongly connected to vitamin K antagonists () (edge weight = 1.50), heart failure (card_ins) (edge weight = 1.0), mitral valve insufficiency (mit_ins) (edge weight = 0.52), angina pectoris (br_pain) (edge weight = 0.39), dyspnea during physical strain () (edge weight = 0.38) and during the night () (edge weight = 0.26), other heart valve anomalies () (edge weight = 0.18), anti thrombotic drugs () (edge weight = 0.17), underwent temporary dialysis (temp_dial), and were positively associated with an NMR bucket at 3.275ppm (edge weight = 0.12), identified as trimethylamine-N-oxide (TMAO) and minor signals of D-glucose and betaine. is strongly connected to coronary angiopathy () (edge weight = 1.80), underwent cardiac surgery () (edge weight = 1.32), aortic valve stenosis () (edge weight = -0.65), acute renal failure (acute_fail) (edge weight = 0.55), angina pectoris () (edge weight = 0.49), heart failure () (edge weight = 0.43), antiplatelet therapy () (edge weight = 0.41), underwent catheter angiography of peripheral arteries including angioplasty of a peripheral artery () (edge weight = 0.21), mitral valve insufficiency () (edge weight = 0.20), stroke () (edge weight = 0.19), serum cholesterol levels () (edge weight = -0.16), anti thrombotic drugs (med_antipl) (edge weight = 0.16), and an NMR bucket at 3.275ppm (edge weight = 0.14).

Supplementary Figure 17 displays the distribution of NMR signal intensities at 3.275 ppm for both patients without and with cardiac arrhythmia for training and test data, respectively. Comparing both distributions by a Student’s -test yields a -value of 5.3 and 6.8 for the training and test cohort, respectively. To explore possible confounding effects, we adjusted the NMR bucket intensities of 3.275 ppm for hypertension, BMI, smoking, and diabetes, which have been reported as confounders in [32]. Intake of total choline was not accessed in the GCKD study and could therefore not be used for confounder adjustment. The corresponding boxplots for the residuals of NMR bucket intensities at 3.275 ppm are shown in Supplementary Figure 17(b) and (d), respectively. The -values comparing the cardiac arrhythmia with the non-cardiac arrhythmia group are still significant with values of 2.3 and 0.012 for training and test set, respectively.

The first order neighborhoods of cardiac arrhythmia and cardiac infarction are highly predictive on independent test data with AUCs of 0.80 and 0.93, respectively, compare to Figure 6 (c) and (d).

4 Discussion

We have shown that MGMs are a versatile tool for the integration of both continuous (Gaussian) and categorical variables. Here, we used data collected from patients that suffered from CKD and included variables comprising clinical and demographic parameters, medications, and blood plasma metabolites assessed by NMR spectroscopy. We could reveal several known relationships, such as the definitions of UACR, eGFR, and diabetes. More interestingly, we observed complex associations between plasma metabolites and comorbidities like gout and cardiac disorders. Moreover, we could validate those associations on test data.

Emerging datasets provide more and more layers of information, which makes data analysis increasingly complex. This high complexity leaves the researcher with an overwhelming amount of possible hypotheses, which require extensive tests and validations. Thus, there is an urgent need for automated tools that hint towards interesting observations. Here, MGMs are predestined, because they condense the available data into graphs that researchers can easily read and interpret. Moreover, MGMs consider the whole system of variables and measurements simultaneously. As a consequence, they correct for confounding factors, such as age, gender, and medications, automatically. Thus, if the MGM observes an interesting association, it is likely not a bystander effect of other variables if those are part of the analysis itself. We showed this in a scenario where we corrected for artificially selected confounders. Particularly, we could demonstrate that confounder adjustment can be necessary to reveal associations, as we exemplified for the association of gout with high alcohol consumption. Finally, we illustrated that the MGM discovered an association of TMAO with cardiac arrhythmia, which remains significant after adjustment for variables selected by expert knowledge.

MGMs are undirected graphical models. Thus, they do not include information about causality. In the context of Gaussian graphical models, there are plenty of works that estimate causal relations from observational data only [34, 35, 36]. Combining those methods with MGMs could further strengthen our data integration approach.

In summary, we tested MGMs for the integrative analysis of categorical and continuous variables in the context of CKD. We illustrated its application, provided software to estimate MGMs, and exemplified their interpretation. Moreover, we have reported interesting associations between plasma metabolites and the CKD comorbidities gout and cardiac disorders.

5 Acknowledgement

This work was supported in part by the German Federal Ministry of Education and Research (BMBF grant no. 01 ER 0821).

GCKD investigators are: University of Erlangen-Nürnberg, Germany: Kai-Uwe Eckardt, Stephanie Titze, Hans-Ulrich Prokosch, Barbara Bärthlein, André Reis, Arif B. Ekici, Olaf Gefeller, Karl F. Hilgers, Silvia Hübner, Susanne Avendao, Dinah Becker-Grosspitsch, Nina Hauck, Susanne A. Seuchter, Birgit Hausknecht, Marion Rittmeier, Anke Weigel, Andreas Beck, Thomas Ganslandt, Sabine Knispel, Thomas Dressel, Martina Malzer
Technical University of Aachen, Germany: Jürgen Floege, Frank Eitner, Georg Schlieper, Katharina Findeisen, Elfriede Arweiler, Sabine Ernst, Mario Unger, Stefan Lipski
Charité, Humboldt-University of Berlin, Germany: Elke Schaeffner, Seema Baid-Agrawal, Kerstin Petzold, Ralf Schindler
University of Freiburg, Germany: Anna Köttgen, Ulla Schultheiss, Simone Meder, Erna Mitsch, Ursula Reinhard, Gerd Walz
Hannover Medical School, Germany: Hermann Haller, Johan Lorenzen, Jan T. Kielstein, Petra Otto
University of Heidelberg, Germany: Claudia Sommerer, Claudia Föllinger, Martin Zeier
University of Jena, Germany: Gunter Wolf, Martin Busch, Katharina Paul, Lisett Dittrich
Ludwig-Maximilians University of München, Germany: Thomas Sitter, Robert Hilge, Claudia Blank
University of Würzburg, Germany: Christoph Wanner, Vera Krane, Daniel Schmiedeke, Sebastian Toncar, Daniela Cavitt, Karina Schönowsky, Antje Börner-Klein
Medical University of Innsbruck, Austria: Florian Kronenberg, Julia Raschenberger, Barbara Kollerits, Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner
University of Regensburg, Germany: Peter J. Oefner, Wolfram Gronwald, Helena Zacharias
Department of Medical Biometry, Informatics and Epidemiology (IMBIE), University of Bonn: Matthias Schmid

References

  • [1] Rolf Holle, Michael Happich, Hannelore Löwel, Heinz-Erich Wichmann, et al. Kora-a research platform for population based health research. Das Gesundheitswesen, 67(S 01):19–25, 2005.
  • [2] Thomas Illig, Christian Gieger, Guangju Zhai, Werner Römisch-Margl, Rui Wang-Sattler, Cornelia Prehn, Elisabeth Altmaier, Gabi Kastenmüller, Bernet S Kato, Hans-Werner Mewes, et al. A genome-wide perspective of genetic variation in human metabolism. Nature genetics, 42(2):137, 2010.
  • [3] Alireza Moayyeri, Christopher J Hammond, Ana M Valdes, and Timothy D Spector. Cohort profile: Twinsuk and healthy ageing twin study. International journal of epidemiology, 42(1):76–85, 2012.
  • [4] Vivekanand Jha, Guillermo Garcia-Garcia, Kunitoshi Iseki, Zuo Li, Saraladevi Naicker, Brett Plattner, Rajiv Saran, Angela Yee-Moon Wang, and Chih-Wei Yang. Chronic kidney disease: global dimension and perspectives. The Lancet, 382(9888):260–272, 2013.
  • [5] Andrew S Levey and Josef Coresh. Chronic kidney disease. The Lancet, 379(9811):165–180, 2012.
  • [6] Kai-Uwe Eckardt, Josef Coresh, Olivier Devuyst, Richard J Johnson, Anna Köttgen, Andrew S Levey, and Adeera Levin. Evolving importance of kidney disease: from subspecialty to global health burden. The Lancet, 382(9887):158–169, 2013.
  • [7] Ulrich Kuhlmann. Nephrologie: Pathophysiologie-Klinik-Nierenersatzverfahren; 252 Tabellen. Georg Thieme Verlag, 2008.
  • [8] KDW Group and others. Kidney Disease: Improving Global Outcomes (KDIGO) 2012 clinical practice guideline for the evaluation and management of chronic kidney disease. Kidney International, 3:1–150, 2013.
  • [9] Lakhmir S Chawla, Paul W Eggers, Robert A Star, and Paul L Kimmel. Acute kidney injury and chronic kidney disease as interconnected syndromes. New England Journal of Medicine, 371(1):58–66, 2014.
  • [10] John F O’Toole and John R Sedor. Kidney disease: new technologies translate mechanisms to cure. The Journal of clinical investigation, 124(6):2294–2298, 2014.
  • [11] Kai-Uwe Eckardt, Barbara Bärthlein, Seema Baid-Agrawal, Andreas Beck, Martin Busch, Frank Eitner, Arif B Ekici, Jürgen Floege, Olaf Gefeller, Hermann Haller, et al. The German chronic kidney disease (GCKD) study: design and methods. Nephrology Dialysis Transplantation, 27(4):1454–1460, 2011.
  • [12] Stephanie Titze, Matthias Schmid, Anna Köttgen, Martin Busch, Jürgen Floege, Christoph Wanner, Florian Kronenberg, Kai-Uwe Eckardt, GCKD Study Investigators, Kai-Uwe Eckardt, et al. Disease burden and risk profile in referred patients with moderate chronic kidney disease: composition of the German Chronic Kidney Disease (GCKD) cohort. Nephrology Dialysis Transplantation, 30(3):441–451, 2014.
  • [13] David S Wishart. Metabolomics in monitoring kidney transplants. Current opinion in nephrology and hypertension, 15(6):637–642, 2006.
  • [14] Jan Krumsiek, Jörg Bartel, and Fabian J Theis. Computational approaches for systems metabolomics. Current opinion in biotechnology, 39:198–206, 2016.
  • [15] Jonas Zierer, Cristina Menni, Gabi Kastenmüller, and Tim D Spector. Integration of ’omics’ data in aging research: from biomarkers to systems biology. Aging cell, 14(6):933–944, 2015.
  • [16] Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
  • [17] Jason D Lee and Trevor J Hastie. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics, 24(1):230–253, 2015.
  • [18] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
  • [19] Yurii Nesterov et al. Gradient methods for minimizing composite objective function, 2007.
  • [20] Brendan O’Donoghue and Emmanuel Candes. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715–732, 2015.
  • [21] Andrew S Levey, Lesley A Stevens, Christopher H Schmid, Yaping Lucy Zhang, Alejandro F Castro, Harold I Feldman, John W Kusek, Paul Eggers, Frederick Van Lente, Tom Greene, et al. A new equation to estimate glomerular filtration rate. Annals of internal medicine, 150(9):604–612, 2009.
  • [22] Lesley A Stevens, Josef Coresh, Christopher H Schmid, Harold I Feldman, Marc Froissart, John Kusek, Jerome Rossert, Frederick Van Lente, Robert D Bruce III, Yaping Lucy Zhang, et al. Estimating gfr using serum cystatin c alone and in combination with serum creatinine: a pooled analysis of 3,418 individuals with ckd. American journal of kidney diseases, 51(3):395–406, 2008.
  • [23] Shariq I Sherwani, Haseeb A Khan, Aishah Ekhzaimy, Afshan Masood, and Meena K Sakharkar. Significance of hba1c test in diagnosis and prognosis of diabetic patients. Biomarker insights, 11:BMI–S38440, 2016.
  • [24] Ana Beatriz Vargas-Santos and Tuhina Neogi. Management of gout and hyperuricemia in ckd. American Journal of Kidney Diseases, 70(3):422–439, 2017.
  • [25] Jiaojiao Jing, Jan T Kielstein, Ulla T Schultheiss, Thomas Sitter, Stephanie I Titze, Elke S Schaeffner, Mara McAdams-DeMarco, Florian Kronenberg, Kai-Uwe Eckardt, Anna Köttgen, et al. Prevalence and correlates of gout in a large cohort of patients with chronic kidney disease: the german chronic kidney disease (gckd) study. Nephrology Dialysis Transplantation, 30(4):613–621, 2014.
  • [26] MH Mahbub, Natsu Yamaguchi, Hidekazu Takahashi, Ryosuke Hase, Hiroki Amano, Mikiko Kobayashi-Miura, Hideyuki Kanda, Yasuyuki Fujita, Hiroshi Yamamoto, Mai Yamamoto, et al. Alteration in plasma free amino acid levels and its association with gout. Environmental health and preventive medicine, 22(1):7, 2017.
  • [27] Jasvinder A Singh, Supriya G Reddy, and Joseph Kundukulam. Risk factors for gout and prevention: a systematic review of the literature. Current opinion in rheumatology, 23(2):192, 2011.
  • [28] Kenneth G Saag and Hyon Choi. Epidemiology, risk factors, and lifestyle modifications for gout. Arthritis research & therapy, 8(1):S2, 2006.
  • [29] Elizabeth R De Oliveira e Silva, MonnieMcGee Harper, Cynthia E Seidman, Jonathan D Smith, Jan L Breslow, Eliot A Brinton, et al. Alcohol consumption raises hdl cholesterol levels by increasing the transport rate of apolipoproteins ai and a-ii. Circulation, 2000.
  • [30] Kenneth J Mukamal, Rachel H Mackey, Lewis H Kuller, Russell P Tracy, Richard A Kronmal, Murray A Mittleman, and David S Siscovick. Alcohol consumption and lipoprotein subclasses in older adults. The Journal of Clinical Endocrinology & Metabolism, 92(7):2559–2566, 2007.
  • [31] Peter Würtz, Sarah Cook, Qin Wang, Mika Tiainen, Tuulia Tynkkynen, Antti J Kangas, Pasi Soininen, Jaana Laitinen, Jorma Viikari, Mika Kähönen, et al. Metabolic profiling of alcohol consumption in 9778 young adults. International journal of epidemiology, 45(5):1493–1506, 2016.
  • [32] Gard FT Svingen, Hui Zuo, Per M Ueland, Reinhard Seifert, Kjetil H Løland, Eva R Pedersen, Peter M Schuster, Therese Karlsson, Grethe S Tell, Hall Schartum-Hansen, et al. Increased plasma trimethylamine-n-oxide is associated with incident atrial fibrillation. International journal of cardiology, 267:100–106, 2018.
  • [33] WH Wilson Tang, Zeneng Wang, Bruce S Levison, Robert A Koeth, Earl B Britt, Xiaoming Fu, Yuping Wu, and Stanley L Hazen. Intestinal microbial metabolism of phosphatidylcholine and cardiovascular risk. New England Journal of Medicine, 368(17):1575–1584, 2013.
  • [34] Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm.

    Journal of Machine Learning Research

    , 8(Mar):613–636, 2007.
  • [35] Marloes H Maathuis, Markus Kalisch, Peter Bühlmann, et al. Estimating high-dimensional intervention effects from observational data. The Annals of Statistics, 37(6A):3133–3164, 2009.
  • [36] Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4):247, 2010.
  • [37] Jens Wallmeier, Claudia Samol, Lisa Ellmann, Helena U Zacharias, Franziska C Vogl, Muriel Garcia, Katja Dettmer, Peter J Oefner, Wolfram Gronwald, and GCKD Study Investigators. Quantification of metabolites by nmr spectroscopy in the presence of protein. Journal of proteome research, 16(4):1784–1796, 2017.
  • [38] Helena U Zacharias, Jochen Hochrein, Matthias S Klein, Claudia Samol, Peter J Oefner, and Wolfram Gronwald. Current experimental, bioinformatic and statistical methods used in nmr based metabolomics. Current Metabolomics, 1(3):253–268, 2013.
  • [39] Jason Lee and Trevor Hastie. Structure learning of mixed graphical models. In Carlos M. Carvalho and Pradeep Ravikumar, editors,

    Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics

    , volume 31 of Proceedings of Machine Learning Research, pages 388–396, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. PMLR.
  • [40] Jiahua Chen and Zehua Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008.
  • [41] Rina Foygel and Mathias Drton. Extended bayesian information criteria for gaussian graphical models. In Advances in neural information processing systems, pages 604–612, 2010.
  • [42] Sacha Epskamp, Angélique OJ Cramer, Lourens J Waldorp, Verena D Schmittmann, Denny Borsboom, et al. qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software, 48(4):1–18, 2012.
  • [43] Gabor Csardi and Tamas Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.
  • [44] Georg Heinze and Meinhard Ploner.

    logistf: Firth’s Bias-Reduced Logistic Regression

    , 2016.
    R package version 1.22.

6 Supplementary materials and methods

6.1 Cohort description

The study cohort comprises 3705 participants of the German Chronic Kidney Disease (GCKD) study, whose detailed baseline clinical and demographic characteristics are described elsewhere [11, 12]. Written declarations of consent had been obtained from all study participants before inclusion. From each patient, one EDTA-plasma specimen had been collected at the baseline time-point and stored at -80C until NMR measurement.

6.2 Clinical and demographic variables

For this study, we included 17 clinical chemistry parameters measured by SYNLAB International GmbH (Munich, Germany) and Central Lab, University Hospital Erlangen, Germany [12], 73 demographic parameters including age, sex, disease history and lifestyle factors, and 46 different drug treatments, resulting in a total of 136 clinical variables. Supplementary Table S1 provides a list of all included variables, their corresponding distributions, as well as assessment information.

6.3 NMR spectroscopy

For NMR measurements, 400 L of unfiltered EDTA-plasma were mixed with 200 L of 0.1 mol/L phosphate buffer at pH 7.4, 50 L of 0.75% (w) 3-trimethylsilyl-2,2,3,3-tetradeuteropropionate (TSP) dissolved in deuterium oxide, and 10 L of 81.97 mmol/L formic acid (all from Sigma-Aldrich, Taufkirchen, Germany), the latter serving as internal standard for referencing and quantification [37]. NMR experiments were carried out on a 600 MHz Bruker Avance III (Bruker BioSpin GmbH, Rheinstetten, Germany) employing a triple-resonance (H, C P, H lock) cryogenic probe equipped with z-gradients and an automatic cooled sample changer. Since the presence of broad protein signals might hinder the interpretation of metabolite signals, 1D H spectra employing a Carr-Purcell-Meiboom-Gill (CPMG) pulse-sequence for the suppression of macromolecular signals were acquired for each EDTA-plasma specimen. For each 1D H CPMG spectrum, 128 scans were collected into 73728 data points employing presaturation during relaxation for water suppression. Four dummy scans were acquired prior to measurement, the spectral width was 20.02 ppm, the relaxation delay was 4 s and the acquisition time amounted to 3.07 s. The filtering delay amounted to 0.08 s, resulting in a total acquisition time of about 16 min per sample.
NMR signals were assigned to known metabolites by comparison with reference spectra of pure compounds acquired under equal experimental conditions employing the Bruker Biofluid Reference Compound Database BBIOREFCODE 2-0-3 [38].

6.4 Data preprocessing

All continuous clinical and demographic variables except for age, systolic and diastolic blood pressure were log

transformed to remove heteroscedasticity. Recoding of discrete variables is detailed in Supplementary Table S1. In summary, we included 25 continuous and 111 discrete clinical and demographic variables, respectively.


All NMR spectra were referenced with respect to the formic acid signal at 8.463 ppm. Since signal positions between spectra may be subject to minor shifts due to slight differences in pH, salt concentration, and/or temperature between samples, equidistant binning was employed to compensate for these effects. The spectral region from 9.5 - 0.5 ppm was evenly split into 900 bins of 0.01 ppm width using Amix VIEWER 3.9.13 (Bruker BioSpin GmbH, Rheinstetten, Germany). All spectral data were scaled relative to the formic acid signal from 8.5 - 8.45 ppm in order to reduce variations in spectrometer performance. NMR buckets corresponding to the remainders of the suppressed water signal as well as minor contributions from urea (6.0 - 4.5 ppm), formic acid (8.47 - 8.45 ppm), as well as dimethylamine (2.73 - 2.72 ppm) have been excluded prior to data analysis, resulting in a total number of 743 NMR buckets.
For further analysis, data were imported into version 3.2.1 (Development Core Team 2009). To minimize heteroscedasticity of the NMR data, bucket intensities were log transformed and additionally subjected to mean-value subtraction.
All continuous variables were scaled to standard units.

6.5 Mixed Graphical Models

6.5.1 Probability density function

Integration of all discrete and continuous variables is achieved by estimating the conditional dependencies between them by a Mixed Graphical Model (MGM) [16]. MGMs are undirected probabilistic graphical models, where each node corresponds to one variable, and the edges between two nodes represent a conditional dependency between them given all other variables in the graphical model. If there exists no edge between two nodes, these two variables are conditionally independent of each other given all other variables in the MGM.

Lee and Hastie [39]

proposed the following probability density function

to describe a pairwise graphical model on continuous and discrete variables:

(1)

Here, denotes the th of continuous variables, denotes the th of discrete variables with states, represents the continuous - continuous edge, and the continuous node potential, respectively,

is the continuous - discrete edge potential, represented as a vector of size

, is the discrete - discrete edge potential, and summarizes the whole parameter space. describes conditional dependencies between two continuous variables and , corresponding to the probability density function of a Gaussian Graphical Model (GGM) [39]. If , no edge between and appears in the MGM, indicating an independency between these two nodes conditioned against all other nodes in the graphical model. Analogously, , which corresponds to a discrete pairwise Markov Random Field [39], describes conditional dependencies between two discrete variables and with and states, respectively. If all entries of the matrix are equal to zero, the two discrete variables are conditionally independent given all others. Finally, gives the conditional dependencies between a continuous variable and a discrete variable , represented by a vector of size . If all entries of this vector are equal to zero, the two variables are conditionally independent given all other variables in the MGM. In summary, if the whole parameter space is determined, we are able to fully describe all conditional dependencies between all variables in the considered system, here the population under investigation in the GCKD study, which can be represented in a network.

6.5.2 Pseudo-log-likelihood, LASSO penalty, and algorithmic implementation

A maximum likelihood estimate (MLE) of Eq. (6.5.1) requires a summation over all discrete states, which is numerically only feasible for small numbers of discrete variables. As an alternative, we use the pseudo-likelihood method of [39], where the negative pseudo log-likelihood is given by

(2)

Here, is the conditional distribution of a Gaussian variable , and is the conditional distribution of a discrete variable with states:

(3)

Note, that the pseudo-log-likelihood method does not require a summation over the whole state space, which makes it computationally more tractable than MLE.

We augmented the pseuod-log-likelihood estimate of (2) by a LASSO penalty to calibrate overfitting

(4)

Here, we used the definitions

(5)

The weights , , and were chosen as

(6)

where

is the standard deviation of the continuous variable

and and [39]. For the multinomial variables, our parameter space is redundant [39]. This was taken into account by multiplying the penalty factors which correspond to baseline levels by a factor of 10. This forces their couplings to zero.

Optimization problem Eq. (4) is convex [39] and can be efficiently solved by proximal algorithms. We implemented our software in Python, with computationally expensive parts written in C. Here, we used the Python package apgpy accessed under https://github.com/bodono/apgpy, which performs the proximal gradient descent with Nesterov acceleration and adaptive restarts [20].

6.5.3 Calibration of the penalty parameter

We chose the penalty parameter according to the Extended Bayesian Information Criterion [40, 41]

(7)

where is a parameter that can be chosen in the interval , is the number of measurements and is the number of variables. E is the set of edge parameters unequal to zero, and the corresponding maximized pseudo-log-likelihood. For calculating both and E we removed the baseline levels from the adjacency matrix. The minimum of Eq. (7) determines the penalty parameter , which we estimated by line search. This line search sweeps the -sequence , with . Throughout the article, we discuss the model , which corresponds to the sparsest model according to EBIC. For visualization, we used the R packages qgraph [42] and igraph [43], where we transformed the edge weights by the formula .

6.6 Association analyses

Analysis were performed using the statistical analysis software R version 3.3.3. For logistic regression, we used the R package logistf [44].

7 Supplementary Figures

Figure 12: First order neighborhood of high alcohol consumption (alcohol). Strong associations are observed between high alcohol consumption and male gender (edge weight = 0.682), acute kidney failure () (edge weight = 0.267), high density lipoprotein () values (edge weight = 0.211), gout (edge weight = 0.202), an NMR bucket at 1.145 ppm (edge weight = 0.19), containing broad lipid signals mainly identified as low density lipoprotein (LDL), an unidentified NMR bucket at 0.835 ppm (edge weight = 0.184), age (edge weight = 0.179), an unidentified NMR bucket at 2.755 ppm (edge weight = -0.138), an NMR bucket at 2.675 ppm (edge weight = -0.134), identified as citric acid, an NMR bucket at 1.925 ppm (edge weight = 0.131), identified as acetate (major signal) and minor contributions from arginine and lysine, and an NMR bucket at 3.275 ppm (edge weight = 0.127), identified as trimethylamine-N-oxide (TMAO) and minor contributions from D-glucose and betaine.
Figure 13: Smooth scatter plot of randomly adjusted -values (x-axis) versus unadjusted (univariate) -values (y-axis) for the univariate screening (a), and the MGM (b) in the test set.
Figure 14: Effects of variable adjustment on univariate or MGM association analysis in the test set. (a) “top assoc.” shows the distribution of derived from a univariate regression. Here, we calculated -values between all possible pairs of variables and collected all top associations. (a) “top neigh.” shows the analogous distribution, where the top feature was selected by largest absolute regression weight in the MGM neighborhood. (b) The corresponding plot, where the -values were corrected by the top five confounder variables of the univariate and MGM screening, respectively. (c) The corresponding plot, where we adjusted for the same five randomly selected features for both methods.
Figure 15: (a) to (c) show the differences between “top neigh.” and “top assoc.” in Figure 14a to c, respectively: (a) shows the of the MGM approach minus those of the univariate screening in Figure 14a, (b) shows the corresponding plot after adjusting for the respective top confounders, as shown in Figure 14b, and (c) shows the corresponding plot for adjusting for the randomly selected confounders, as shown in Figure 14c. The red point in each figure contrast the values on the y-axis with their respective rank. On the x-axis, the highest positive difference correspond to and the most negative to . The green shaded areas correspond to rank percentiles of negative, the violet shaded areas correspond to rank percentiles of positive differences.
Figure 16: Smooth scatter plot of adjusted -values (x-axis) versus unadjusted (univariate) -values (y-axis) for the univariate screening (a), and the MGM (b) in the test set.
Figure 17: Boxplots comparing NMR signal intensities at 3.275ppm, which comprises NMR signals from trimethylamine-N-oxide (TMAO) (major), as well as minor signals from D-glucose and betaine. Log-transformed NMR signal intensities for patients with () and without cardiac arrhythmia () for both training (a) and test data (c), respectively, are shown. The residuals of log-transformed NMR signal intensities after adjustment for BMI, systolic blood pressure, smoking status, and diabetes status for both training (b) and test data (d) are shown. -values of corresponding -tests are displayed above the boxes.