1 Introduction
Cardiovascular diseases (CVD) including atherosclerosis CVD (ASCVD) are multifactorial, and are caused by a combination of many factors such as genetic, biological, and environmental factors. Many research applications and methods continue to demonstrate that a better understanding of the pathobiology of CVD and other complex diseases necessitates analysis techniques that go beyond the use of established traditional factors to one that integrate clinical and demographic data with molecular and/or functional data. For instance, in a recent CVD research, the authors in (Benson et al., 2018) integrated genomics data with CVDrisk associated proteins in human plasma and they identified many genetic variants (not previously described in the literature) which, on average, accounted for 3 times the amount of variation in CVD explained by common clinical factors, such as age, sex, and diabetes mellitus (Benson et al., 2018). Given that environmental risk factors for ASCVD (e.g., age, gender, hypertension) account for only half of all cases of ASCVD (Bartels et al., 2012), it remains a critical need to identify other biomarkers contributing to ASCVD. It is also recognized that leveraging existing sources of biological information including gene pathways can guide the selection of clinically meaningful biomarkers, and reveal pathways associated with disease risk. In this study, we build models for associating multiple omics data and simultaneously predicting a clinical outcome while considering prior functional information. The models are motivated by, and are used for, the integrative analysis of multiomics and clinical data from the Emory Predictive Health Institute, where the goal is to identify genetic variants, genes, and gene pathways potentially contributing to 10year ASCVD risk in healthy patients.
Our approach builds and extends recent research on joint integrative analysis and prediction/classification methods (Kan et al., 2015; Luo et al., 2016; Li and Li, 2018; Safo et al., 2019)
. Unlike two step approaches– integrative analysis followed by prediction or classification– these joint methods link the problem of assessing associations between multiple omics data to the problem of predicting or classifying a clinical outcome. The goal is then to identify linear combinations of variables from each omics data that are associated and are also predictive of a clinical outcome. For instance, the authors in
(Luo et al., 2016; Li and Li, 2018; Safo et al., 2019) combined canonical correlation analysis with classification or regression methods for the purpose of predicting a binary or continuous response. Luo et al. (2016), used a regression formulation of canonical correlation analysis (CCA) and proposed a joint method for obtaining canonical correlation vectors and predicting an outcome from the canonical correlation vectors. The authors in
(Safo et al., 2019) proposed a joint association and classification method that combines CCA and linear discriminant analysis, and uses the normalized Laplacian of a graph to smooth the rows of the discriminant vectors from each omics data. This approach encourages predictors that are connected and behave similarly to be selected or neglected together. These joint association and prediction methods have largely been developed from a frequentist perspective. While there exist Bayesian factor analysis or CCA methods to integrate multiomics data types (Mo et al., 2018; Klami et al., 2013), we are not aware of any Bayesian method that has been developed for joint association and prediction of multiple data types. Mo et al. (2018) extended icluster into a Bayesian framework, and Klami et al. (2013) developed a Bayesian CCA model by imposing a groupwise sparsity for interbattery factor analysis (IBFA).In the present work, we extend current frequentist approach for integrative analysis and prediction methods to the Bayesian setup. Our contributions are multifold. First, we adopt a factor analysis (FA) approach to integrate multiple data types as done in (Shen et al., 2009, 2013; Klami et al., 2013)
. This allows us to reduce simultaneously the dimension of each data type to a shared component with a much reduced number of features (or components). On the other hand, our formulation of the problem is different from existing methods as it is defined in a Bayesian hierarchical setting which has the flexibility to incorporate other functional information through prior distributions. Our approach uses ideas from Bayesian sparse group selection to identify simultaneously active components and important features within components using two nested layers of binary latent indicators. Second, our approach is more general; we are able to account for the active shared components as well as the individual latent components for each data type. Third, we incorporate in a unified procedure, clinical responses that are associated with the shared components, allowing us to evaluate the predictive performance of the shared components. Moreover, our formulation makes it easy to include other covariates without enforcing sparsity on their corresponding coefficients. Including other available covariates may inform the shared components, which in turn may result in better predictive performance of the shared components. Fourth, unlike most integrative methods that use FA, our prior distributions can be extended to incorporate external grouping information; this allows us to determine most important groups of features that are associated with important components. Incorporating prior knowledge about grouping information has the potential of identifying functionally meaningful variables (or network of variables) within each data type for improved prediction performance of the response variable. In our estimation, we implement a Markov Chain Monte Carlo (MCMC) algorithm for posterior inference that employs a partially collapsed Gibbs sampling
(van Dyk and Park, 2008) to sample the latent variables and the loadings.1.1 Motivating Application
There is an increased interest in identifying “nontraditional” risk factors (e.g., genes, genetic variants) that can predict CVD and ASCVD. This is partly so because CVD has become one of the costliest chronic disease, and continue to be the leading cause of death in the U.S. (AHA, 2016). It is projected that nearly half of the U.S. population will have some form of cardiovascular disease by 2035 and will cost the economy about billion/day in medical costs (AHA, 2016). There have been significant efforts in understanding the risk factors associated with ASCVD, but research suggests that the environmental risk factors for ASCVD (e.g., age, gender, hypertension) account for only half of all cases of ASCVD (Bartels et al., 2012). Finding other risk factors of ASCVD and CVD unexplained by traditional risk factors is important and potentially can serve as targets for treatment.
We integrate gene expression, genomics, and/or clinical data from the Emory University and Georgia Tech Predictive Health Institute (PHI) study to identify potential biomarkers contributing to 10year ASCVD risk among healthy patients. The PHI study, which began in 2005, is a longitudinal study of healthy employees of Emory University and Georgia Tech aimed at collecting health factors that could be used to recognize, maintain, and optimize health rather than to treat disease. The goal of the PHI, the need to identify other biomarkers of atherosclerosis cardiovascular diseases, and the data available from PHI (multiomics, clinical, demographic data), motivate our development of Bayesian methods for associating multiple omics data and simultaneously predicting a clinical outcome while accommodating prior functional information and clinical covariates.
The rest of the paper is organized as follows. In Section 2, we present the formulation of the model approach. In Section 3
, we describe our MCMC algorithm, present how to evaluate our prediction performance, and we set hyperparameters of prior distributions. In Section
4, we evaluate and compare our methodology using simulated data. Section 5 shows the application to atherosclerosis disease using both mRNA expression and SNP information. Finally, we conclude our manuscript in Section 6.2 Model formulation
Our primary goal is to define an integrative approach that combines multiple data types and incorporates clinical covariates, response variables, and external grouping information. In Section 2.1, we introduce the Factor analysis model that relates data types. We then adopt a Bayesian variable selection technique in Section 2.2 to select active components and their corresponding important features. We show how we incorporate grouping information in Section 2.3.
2.1 Factor analysis model for integrating multiple data types and clinical outcome
We assume that (omics) data types are available: , where is the measure of subject on the feature for data type . A factor analysis model for the data types can be written as
(1)  
Here,
is the identity matrix of size
. is a coefficient matrix for data type with each row representing the factor loadings for components . is a matrix of latent variables that connects the sets of models, thus inducing dependencies among the data types. The latent variable is intended to explain the correlations across the data types and is the error that explains the remaining variability unique to each data type. The integrated data matrix is then multivariate normal with mean zero and covariance matrix , where and is the diagonal block matrix of , . As it is commonly done in factor analysis, we assume whereis the variance of feature
in data set . Model (1) is also called the Gaussian latent variable model. This model has been extensively used for integrative clustering of multiple genomic datasets (Shen et al., 2009, 2013; Chalise et al., 2014)In this paper, we are interested in examining the association between the two sets of features and performing predictive modeling of the response in an integrative way. Hence, in addition to model (1), we treat the clinical outcome as another variable set, and is related to feature matrices through the latent variables . Thus, the clinical equation can be written as , where is a coefficient vector of the effect component , , has on the outcome; is the intercept and is the error term. We also assume that there is another data type which corresponds to the set of clinical covariates.
2.2 Bayesian variable selection using latent binary variables
To determine active components for each data type (including the outcome), we define an binary latent variable vector, , given by if component is active in data type and 0 otherwise for . We assume that for , for every so that all components are potentially active for clinical covariates.
We are not only interested in selecting the active components but we also want to identify variables that significantly contribute to active components. Unlike the single layer of indicators used in the groupwise sparsity for interbattery factor analysis (IBFA) (Klami et al., 2013), we use two nested layers of binary indicators to denote whether a component or a variable is selected or not. A similar idea was recently used for Bayesian sparse group selection in regression models (Chen et al., 2016). At the component level, for every , indicates which components are active for data type . At the variable level, is an indicator vector for variables in the th component. Specifically, if feature is selected in component in data type . The prior distribution for the component indicator,
, is the Bernoulli distribution with parameter
. In the th component, the prior distribution of is dependent on the indicator and is defined as(2) 
where
is the Bernoulli distribution with probability
. Thus, if the th component is not active, then for all . We note that for every and i.e the response variable and clinical covariates will be always included in the model. To infer a sparse matrix , that is whether the entry in is important or not, we impose the following sparsityinducing prior on the matrix elements given the indicators andThus, if the th variable in the active th component is selected, then the prior of
is a normal distribution with zero mean and variance
. Otherwise the coefficient is equal to 0. When , the feature is now the outcome and . The prior distribution of the residual varianceis an inverse Gamma distribution,
. Finally, we assume that andfollows a Beta distribution
.Our method encompasses three different scenarios: i) each component is shared across all data types (including the response) i.e. for every and
. This scenario only accounts for the joint variation between the data types, and is assumed in standard CCA methods for high dimensional data integration such as
(Witten and Tibshirani, 2009; Luo et al., 2016; Safo et al., 2018) ii) none of the components is shared across data types i.e. . This scenario would be similar to the standard factor analysis for each data set independently, and accounts only for the individual variations explained for each data set. iii) some components are only shared across the data sets. Hence, our method can capture the amount of joint structure and individual structure of the data sets. Our method then encompasses the Bayesian CCA and the JIVE (Joint and Individual variation explained) methods introduced respectively by Klami et al. (2013) and Lock et al. (2013). For instance, assume we have other data types (besides response and clinical covariates), components, the first two components are shared (i.e , for every ) and, components 3 and 4 are only associated respectively with data types 1 and 2 (i.e. and ). The models for data types 1 and 2 can then be written as and , where (resp. ) is the matrix of the first two components of (resp. ). In this example, is the shared component, and and can be considered as the individual latent structures for data types 2 and 3 respectively.2.3 Incorporating grouping information through prior distributions
In this section, we present how we incorporate known group structures of the data types through prior distributions. For that, we introduce a design/loading matrix consisting of
columns of dummy variables coding for group membership, for every
. More specifically, if marker from data type belongs to group , and 0 otherwise. Here we assume that this matrix can be retrieved from external scientific knowledge. For instance, genes can be grouped using KEGG pathway information or other pathway/network information (Qiu, 2013). Similarly, SNPs can be grouped using their corresponding genes.In order to incorporate the grouping information , we assume an additional layer from the prior of the factor loadings . More formally, we incorporate the grouping information through the prior distributions of as follows
(4)  
(5) 
where denotes the th row of the matrix and denotes the gamma distribution with shape and rate . The vector represents the coefficient effects of the groups. The intercept can be regarded as a global shrinkage hyperparameter determining the baseline level of shrinkage. From equation (5), each is independently assigned a gamma distribution with expected value . By integrating out , the marginal prior of a nonzero loading is defined by
(6) 
The parameter is assumed to be nonnegative in order to reduce the amount of shrinkage attributable to the th important group. Hence, higher values give more evidence for group information by increasing the effect of features belonging to group (see Figure 2.1 for an illustration of the marginal prior (6) of a loading). A similar prior construction was considered by Rockova and Lesaffre (2014) to incorporate grouping information in the context of regression models.
For the purpose to determine important groups that contribute to the formation of active components, we define another latent binary matrix variable , where if group (or pathway) contributes to active component , and 0 otherwise. The prior of is then dependent on and will be defined as
We finally assume that the intercept follows also a Gamma . Correlations between significant features within the same group are captured through the prior of the penalty parameter (see prior (5)). For instance, if features 1 and 2 both belong to the same group, the joint marginal distribution of (after integrating out ) is
Figure 2.2 draws the contour plots of for four combinations of , with and . As we would expect, as increases and/or decreases, and tend to have a stronger correlation, translating to a higher probability to have similar values.
3 Posterior inference and prediction
For posterior inference, our primary interest is in the estimation of the active component and feature selection indicator variables
, , and the loadings . We implement a Markov Chain Monte Carlo (MCMC) algorithm for posterior inference that employs a partially collapsed Gibbs sampling (van Dyk and Park, 2008) to sample the binary matrices and the loadings. In fact, to sample , given the other parameters, we integrate out all the loadings , and we use a MetropolisHastings step. We then sample the loadings from its full conditional distributions. In Section 7 in the web supplementary material, we present the conditional distributions used to update all the parameters. We outline the steps of our MCMC algorithm hereafter.
Initialize the parameters with 1, with 0.1, and the other parameters by their prior distributions.

Sample , using a MetropolisHastings step as described in web supplementary material Section 7.1.

Sample and simultaneously using a MetropolisHastings step as described in Section 7.5 of the web supplementary material.
Prediction: Given a model , , we estimate the loadings for as , the posterior mode of , where and are the means of the values sampled in the course of the MCMC algorithm. Let , be the vector of features of a new individual. The latent value of a new individual for a given model , , is estimated as , where , is the diagonal matrix with elements on the diagonal, and . Hence the predictive response is computed using a Bayesian model averaging approach (Hoeting et al., 1999) as follows where is the posterior mean of the intercept .
Hyperparameter settings
For both simulations and observed data analysis, we set the hyperparameters as follows. We set as the maximum number of active components. We note that a larger number can be set as our approach is able to select active components for each data through the indicators . We present in Tables 8.4 and 8.5 in the web supplementary material a sensitivity analysis for the parameter . We obtain similar results with larger . We set
, the prior probability to select features for a specific component as we expect that a small number of features are active in every component (refer to Tables
8.2 and 8.3 in web supplementary material for a sensitivity analysis of ). We set , hyperparameters of (prior probability to select a feature) and (prior probability to select a group) which follow each a beta distribution . We set to have a vague gamma prior on . We set , the scale hyperparameter of . Last, we set , hyperparameters of group effects (when ) that follow a gamma distribution .4 Simulation studies
We consider three main scenarios to assess the performance of the proposed methods. In each scenario, there are two data types and simulated according to equation (1), and a single continuous response . We simulate each scenario to have latent components. The scenarios differ by how many components are shared across data types. In the first scenario, the four latent components are shared across all two data types. The first two shared components affect the response. In the second scenario, none of the components is shared across the two data types. Components 1 and 2 are unique to data type 1, and components 3 and 4 are unique to data type 2. Components 1 and 3 are associated with the response in this scenario. In the third scenario, two of the components (1 and 2) are shared, component 3 is unique to data type 1, and component 4 is unique to data type 2. The response is associated with components 1, 3, 4. We partition the covariance matrix in each data type into signal and noise; signal contain variables that are correlated and contribute to the response, while noise variables are uncorrelated and unimportant. In each scenario, we generate 20 Monte Carlo data sets for each data type.
4.1 Scenario One: Each component is shared across all data types
This Scenario is assumed in standard CCA methods for high dimensional data integration.
We simulate two data types , , and , according to equation (1) as follows. Without loss of generality, we let the first variables form the networks or groups in and . Within each data type there are main variables, each connected to variables. The resulting network has variables and edges, and and singletons in and respectively.
The network structure in each data type is captured by the covariance matrices
Here, is block diagonal with blocks of size , betweenblock correlation and within each block there is a compound symmetric submatrix with correlation describing the correlation structure of the connected variables. The correlation between a main and a connecting variable is . We let the and singletons form one group that do not contribute to the outcome and association between and . and are prespecified true sparse factor loading matrices that describe the variables that are important and associated with the outcome. We consider different prespecifications of and to capture the relationship between networks, and variables and networks that are associated with the outcome. Figures 4.1(a)4.1(d) are image plots of loadings, , for settings 1, 2, 4 and 5.
Setting 1: All groups contribute to correlation between and . In this scenario, we let all groups in and contribute to the association between and . We generate the first 100 entries of and for each as independent and identically distributed (i.i.d.
) random samples from the uniform distribution on the set
, but for the 10 main variables, we multiply the effect size by 2. The remaining entries in for each is set to 0. Similarly for . The entries in are randomly generated from . The data matrices , , and are generated, respectively, asHere, we set , so represents observation from four latent components. We set so that the response is only related to the first two shared components. The entries in , , and are all generated as random samples from .
Setting 2: First three groups in and contribute to correlation between . In this scenario, we consider the situation where only some networks contribute to the association between the two data types. We let the first groups in and , corresponding to the first 30 variables in and , contribute to the association between and . The entries in
and are generated following Setting 1.
Setting 3: Some variables within the groups contribute to correlation between . This scenario follows that of Setting 1 but we let only some variables within each of the 10 groups to contribute to the association between and .
We generate the entries of and following Setting 1 but we randomly set at most 5 coefficients within each group to zero.
Setting 4: Some variables and some groups in and contribute to correlation between . This Setting follows Setting 2 but we let only some variables within
the first three groups contribute to the association between and . We generate the entries of and following Scenario 1 but we randomly set at most 5 coefficients within each group to zero.
Setting 5: None overlapping components and all 100 variables contribute to correlation between and . In Settings 14, we considered situations where latent components overlapped with respect to the significant variables. In this Setting, we consider none overlapping components. Specifically, data are generated according to Setting 1, but the first 25 signal variables are loaded on component 1, the next 25 variables on component 2, and so on. The last 25 signal variables are loaded on component 4. Similar to Setting 1, there are 10 groups or networks comprising of the 100 signal variables. This Setting tests our methods ability to identify all 4 components, all 10 groups, and all 100 important variables.
4.1.1 Competing Methods
We compare the performance of our methods with association only and joint association and prediction methods. For the association onlybased methods, we consider the sparse CCA [ SELPCCA] (Safo et al. (2018)) and Fused sparse CCA [FusedCCA] (Safo et al. (2018)) methods. FusedCCA incorporates prior structural information in sparse CCA by allowing variables that are connected to be selected or neglected together. We perform FusedCCA and SELPCCA using the Matlab code the authors provide. For FusedCCA, we assume that variables within each group are all connected since the method takes as input the connections between variables. For the joint association and prediction methods, we consider the CCA regression (CCAReg) method by Luo et al. (2016). We implement CCA regression using the R package CVR . Once all the canonical variates are extracted, they are used to fit a predictive model for the response , and then to compute test MSE.
4.1.2 Evaluation Criteria
To evaluate the effectiveness of the methods, we fit the models using the training data and then compare different methods based on their performance in feature selection and prediction on an independently generated test data of the same size. To evaluate feature selection, we use the number of predictors in the true active set that are not selected (false negatives), the number of noise predictors that are selected (false positives), and Fscore (harmonic mean). To evaluate prediction, we use the mean square error obtained from the testing set. We use 20 Monte Carlo datasets and report averages and standard errors.
4.1.3 Simulation results
In Table 4.2
, we report averages of the evaluation measures for Scenario One. We first compare our method with incorporation of network information (BIPNet) and without using the network information (BIP). In Settings 14, both methods identify one component as important and related to the response. This is not surprising since all 100 important variables on each component overlap in this setting, resulting in our method identifying only one of these components. In Setting 5 where important variables did not overlap on the four latent components, the proposed methods identify at least two components 13 times out of 20 simulated data sets. We note that in general, conmpared to BIP, BIPNet has lower false negatives, lower positive rates, competitive or higher harmonic means, and lower MSEs. These results demonstrate the benefit of incorporating group information in feature selection and prediction. We observe a suboptimal performance for both BIPNet and BIP in Settings 4 and 5, more so in Setting 5 where important variables do not overlap on the latent components. We also assess the group selection performance of BIPnet as follows. We compute AUC by using marginal posterior probabilities of group membership
’s. Table 4.1 shows that most of groups in settings 1, 2 and 5 had been identified correctly. However, the method had difficulties selecting groups in cases where only a few number of features contribute to the association between data types (settings 3 and 4).We next compare the proposed methods to the other existing methods. For the other methods, we estimate both the first and the first four canonical variates. We report the results from the first canonical variates here, and the results for the first four canonical covariates in the online supplementary material, Section 8. For feature selection, a variable is deemed selected if it has at least one nonzero coefficient in any of the four estimated components. We observed that for SELPCCA, CCAReg, and FusedCCA in general, the first canonical variates from each data type result in lower false negatives, lower false positives, higher harmonic means, and competitive MSE for Settings 14 compared to results from the first four canonical variates (see web supplementary materials, Section 8). In Setting 5 where signal variables do not overlap on the four true latent components, we expect the first four canonical variates to have better feature selection and prediction results, and that is what we observe in Table 4.2. Across Settings 14, BIP and BIPNet had lower false negatives, lower or comparable false positives, higher harmonic means, and competitive MSEs when compared to two step methods: association followed by prediction. The feature selection performance of the proposed methods was also better when compared to CCAReg, a joint association and prediction method. CCAReg had lower MSEs in general when compared to the proposed and the other methods. BIPNet and BIP had higher false negatives in Setting 5 compared to Settings 14, nevertheless lower than CCAReg and SELPCCA.
These simulation results suggest that incorporating group information in joint association and prediction methods result in better performance. Although the performance of the proposed methods are superior than existing methods in most Settings, our findings suggest that the proposed methods tend to be better at selecting most of the true signals in the settings where latent components overlap with regards to important variables. In non overlapping settings, our methods tend to select some of the true signals (resulting in higher false negatives). However, the proposed methods tend to have lower false positives, which suggest that we are better at not selecting noise variables. This feature is especially appealing for highdimensional data problems where there tend to be more noise variables.
4.2 Scenario Two: None of the component is shared across data types
We consider the scenario where there is no shared component across the two data types. This scenario is similar to standard factor analysis or principal component analysis for each data type independently, and accounts only for the individual variations explained by each data type. The entries in
are generated from . The data matrices are generated, respectively, asHere, , is a matrix of the first two latent components in . Similarly for . We consider two cases, overlapping and none overlapping components. In the overlapping case, the true sparse loading matrix has 100 signal variables loaded on the first two components; there are fifty variables on each component. This is also true for . In the none overlapping case, the 100 signal variables in (similarly ) do not overlap; there are 50 variables in each component. We set so that the response is only related to the first and third components. The entries in , , and are all generated as random samples from
4.2.1 Results
Table 4.3 shows the averages of the evaluation measures for Scenario Two. As before, we estimate both the first and the first four canonical variates for the other methods, and report results using subscripts 1 and 4 respectively. From Table 4.3, compared to SELPCCA, an association only method that does not utilize prior information, BIPNet and BIP have lower false negatives, lower false positives, higher harmonic means, and lower MSEs in the overlapping case. This findings hold true, generally, when compared to FusedCCA, an association only method that considers variablevariable interactions in sparse CCA. The proposed methods also yield better results in both feature selection and prediction when compared to CCAReg, a joint association and prediction method. The results in the non overlapping case in this Scenario are consistent with results from Setting 5 in Scenario One. For the overlapping case, our methods were able to identify the first (i.e., ) and last two (i.e., ) components 17 and 18 times respectively out of 20 simulated data sets (i.e 85% and 90%). For the none overlapping case, it was respectively 75%, 65%, 85% and 65% for components 1, 2, 3 and 4.
4.3 Scenario Three: Some shared components and individual variations
We consider the scenario where some components are shared across the two data types, and there are both shared and individual components predicting the outcome. As before, the entries in are generated from . The data matrices are generated, respectively, as
Here, , , is a matrix of the first two latent components in that are shared across the two data types. is a vector of the third latent component in that is unique to data type . Similarly, is unique to data type . We consider two cases, overlapping shared components and none overlapping components. In the overlapping case, the true sparse loading matrices have 50 signal variables (corresponding to the first 50 variables) loaded on the first two components (denoted by ), 50 signal variables on the third component (), and no signal variable on the fourth component. This is also true for but there are 50 signal variables loaded on the fourth component () and no signal variables on the third component. In the none overlapping case, the 50 signal variables in do not overlap; the first 25 variables in the first component are important, and variables 26 to 50 are important in the second component. We set so that the response is only related to the first latent component that is shared, and the individual latent components that is unique to the data types. The entries in , , and are all generated as random samples from
4.3.1 Results
This scenario tests the proposed methods ability to identify both shared and individual components, as well as important variables that are loaded on these components. Table 4.4 shows the averages of the evaluation measures for Scenario Three. As before, we report the first and the first four canonical variates using subscripts 1 and 4 respectively. Similar to Setting 5 in Scenario One, the results from the first four canonical variates are generally better than that from the first canonical variates. We compare our methods to these results. From Table 4.4, we observe that the proposed methods have lower false negatives, lower positives, and higher harmonic means compared to the other methods. The prediction performances for the proposed methods are suboptimal. Consistent with results from the other Scenarios, the proposed methods have higher false negatives in the none overlapping case, nevertheless better than CCAReg and SELPCCA. Meanwhile, the false positives are consistent in both overlapping and non overlapping cases, which again indicate that the methods are able to ignore the noise variables while selecting some of the true signals. In the overlapping case, our methods were able to detect components 1 and 2 for all the 20 simulated data sets. Moreover, components 3 and 4 were identified 15 and 10 times respectively. On the other hand, for the none overlapping case, components 1, 2, 3 and 4 were identified 75%, 25%, 25%, and 25% respectively. These findings, together with that from Scenarios One and Three, underscore the benefit of considering group information in joint association and prediction methods.
Setting 1  Setting 2  Setting 3  Setting 4  Setting 5  

Data type 1  1 (0)  0.86 (0.04)  0.58 (0.03)  0.64 (0.06)  1 (0) 
Data type 2  1 (0)  0.81 (0.05)  0.56 (0.03)  0.62 (0.06)  1 (0) 
Method  Setting.  FNR1  FNR2  FPR1  FPR2  F11  F12  MSE 

BIPnet  S1  0 (0)  0 (0)  0.02 (0.02)  0 (0)  100 (0.00)  100 (0)  2.09 (0.04) 
BIP  S1  0 (0)  0 (0)  0.14 (0.06)  0.16 (0.04)  99.73 (0.12)  99.68 (0.08)  2.16 (0.04) 
SELPCCA  S1  0 (0)  0 (0)  0.4 (.09)  0.5 (.12)  99.21 (0.17)  99.02 (0.23)  2.11 (0.04) 
CCAReg  S1  90.25(1.73)  90.40 (1.87)  1.81 (3.32)  1.71 (3.35)  15.34 (1.87)  14.98 (2.01)  2.01 (0.05) 
FusedCCA  S1  0.00 (0.00)  0.00 (0.00)  7.88 (0.76)  10.40 (0.78)  86.69 (1.17)  83.05 (1.08)  2.14 (0.04) 
BIPnet  S2  0 (0)  0 (0)  0.21 (0.15)  0.22 (0.21)  98.57 (0.98)  98.67 (1.25)  2.15 (0.05) 
BIP  S2  0 (0)  0 (0)  0.61 (0.28)  0.45 (0.19)  96.14 (1.71)  96.98 (1.3)  2.2 (0.05) 
SELPCCA  S2  22.00 (8.75)  16.33 (7.51)  .02 (.01)  .04 (.02)  80.17 (7.81)  85.91 (6.37)  2.13 (0.04) 
CCAReg  S2  85.83 (1.87)  86.00 (1.79)  1.54 (.63)  1.27 (.47)  19.52 (1.77)  19.76 (1.47)  2.03 (0.04) 
FusedCCA  S2  0.00 (0.00)  0.00 (0.00)  17.14 (0.33)  17.13 (0.39)  42.79 (0.47)  42.83 (0.53)  2.21 (0.04) 
BIPnet  S3  0 (0)  0 (0)  0.89 (0.11)  1.24 (0.12)  96.88 (0.39)  95.69 (0.41)  2.08 (0.04) 
BIP  S3  0 (0)  0 (0)  0.88 (0.14)  0.73 (0.11)  96.93 (0.47)  97.43 (0.4)  2.1 (0.05) 
SELPCCA  S3  8.50 (5.86)  8.42 (5.81)  0.16 (.06)  .14 (.05)  92.02 (5.07)  92.20 (5.04)  2.05 (0.04) 
CCAReg  S3  83.00 (2.50)  82.92 (2.60)  2.42 (.68)  2.51 (.74)  23.21 (2.71)  23.02 (2.87)  1.98 (0.04) 
FusedCCA  S3  0.00 (0.00)  0.00 (0.00)  12.70 (3.01)  13.89 (3.23)  71.51 (2.31)  69.80 (2.43)  2.18 (0.04) 
BIPnet  S4  0 (0)  0 (0)  2.81 (0.28)  3.07 (0.34)  74.73 (1.9)  73.34 (2.26)  2.24 (0.05) 
BIP  S4  0 (0)  0 (0)  1.82 (0.34)  1.99 (0.36)  83.12 (2.82)  81.91 (2.92)  2.27 (0.04) 
SELPCCA  S4  51.32 (9.50)  42.89 (9.42)  0.10 (0.10)  0.00 (0.0)  54.36 (8.08)  63.42 (8.15)  2.16 (0.04) 
CCAReg  S4  77.63 (1.87)  78.68 (1.85)  3.12 (0.68)  2.71 (0.67)  22.80 (1.15)  23.49 (1.43)  2.01 (0.05) 
FusedCCA  S4  0.00 (0.00)  0.00 (0.00)  19.51 (2.22)  19.64 (2.53)  30.39 (1.03)  30.55 (1.07)  2.58 (0.06) 
BIPnet  S5  61.05 (2.82)  59.75 (3.12)  0.05 (0.03)  0 (0)  54.91 (2.78)  56.12 (3.03)  2.35 (0.21) 
BIP  S5  63 (2.65)  60.35 (2.48)  0.04 (0.03)  0.06 (0.03)  52.89 (2.9)  55.79 (2.63)  2.64 (0.22) 
SELPCCA  S5  93.40(1.55)  93.30(1.85)  0.06(0.06)  0.00 (0.00)  11.60 (2.52)  11.57 (3.00)  2.72 (0.13) 
CCAReg  S5  88.85 (1.81)  87.95 (1.89)  1.44 (0.52)  1.40(0.50)  19.83 (2.30)  21.51 (2.44)  2.26 (.14) 
FusedCCA  S5  0.00 (0.00)  0.00 (0.00)  3.47 (0.81)  2.31 (0.55)  93.89 (1.37)  95.77 (0.97)  2.72 (0.11) 
Method  FNR1  FNR2  FPR1  FPR2  F11  F12  MSE 

Overlap  
BIPnet  0 (0)  0.05 (0.05)  0.05 (0.03)  0.04 (0.02)  99.90 (0.07)  99.90 (0.05)  2.12(0.04) 
BIP  0 (0)  1.25 (0.84)  0.21 (0.06)  0.15 (0.05)  99.58 (0.13)  99.04 (0.44)  2.16 (0.06) 
CCAReg  71.80 (5.56)  75.55 (5.04)  8.81 (2.85)  11.40 (3.32)  29.28 (3.57)  25.05 (3.06)  2.19 (0.04) 
CCAReg  74.85 (2.63)  74.05 (2.65)  10.20 (1.77)  9.78 (1.68)  28.98 (1.77)  30.04 (1.85)  2.32(0.06) 
SELPCCA  68.75(7.23)  93.45 (1.71)  0.64 (0.42)  1.69(0.42)  39.71 (7.21)  13.81 (2.76)  2.37 (0.05) 
SELPCCA  62.70(2.86)  56.45 (3.42)  4.71 (0.39)  4.26(0.47)  46.65 (3.10)  52.64 (3.83)  2.23(0.04) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  6.30 (3.21)  10.19 (3.92)  92.18 (3.21)  87.30 (3.61)  2.24 (0.07) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  33.06 (4.80)  41.81 (4.36)  64.40 (3.93)  57.00 (2.89)  2.18 (0.04) 
No Overlap  
BIPnet  29.85 (6.02)  43.85 (5.66)  0.01 (0.01)  0.02 (0.02)  79.48 (4.41)  68.76 (4.58)  2.00(0.09) 
BIP  34.2 (8.3)  50.35 (8.58)  0.11 (0.05)  0.22 (0.07)  72.38 (7.13)  57.57 (7.6)  2.16 (0.06) 
CCAReg  78.65 (3.91)  76.80 (4.29)  6.96 (2.44)  6.31 (2.40)  25.41 (2.72)  27.58 (3.01)  2.27 (0.08) 
CCAReg  69.20 (2.70)  67.95 (2.79)  8.25 (1.72)  8.33 (1.66)  36.79 (1.41)  37.85 (1.42)  1.77(0.04) 
SELPCCA  87.35(1.48)  87.90 (1.89)  0.36 (0.16)  0.31(0.09)  21.59 (2.31)  20.46 (2.80)  2.44 (0.07) 
SELPCCA  59.35(2.81)  57.55 (1.69)  4.51 (1.84)  3.10(1.02)  51.08 (2.49)  55.28 (2.32)  2.02(0.07) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  1.45 (0.65)  5.53 (1.63)  97.44 (1.08)  91.35 (2.35)  2.23 (0.07) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  25.26 (3.44)  30.21 (4.70)  69.11 (3.16)  66.53 (3.88)  1.84 (0.04) 
Method  FNR1  FNR2  FPR1  FPR2  F11  F12  MSE 

Overlap  
BIPnet  30.2 (4.7)  23.25 (5.22)  0.01 (0.01)  0.05 (0.02)  80.52 (3.18)  84.78 (3.49)  2.90 (0.14) 
BIP  31.55 (4.73)  14.5 (5.02)  0.12 (0.04)  0.11 (0.05)  79.34 (3.14)  90.2 (3.41)  3.26 (0.5) 
CCAReg  82.30( 2.57)  82.10 (2.95)  4.50 (1.29)  4.25 (1.35)  24.23 (2.52)  24.45 (2.69)  3.11 (0.05) 
CCAReg  72.60( 2.66)  74.60 (2.10)  9.38 (1.98)  9.40 (2.13)  32.13 (1.49)  30.74 (1.24)  2.38 (0.05) 
SELPCCA  83.55(4.42)  93.05 (2.28)  0.01 (0.01)  0(0)  24.18 (5.61)  11.82 (2.92)  3.26 (0.05) 
SELPCCA  61.65(3.67)  63.00 (2.07)  5.05 (1.94)  2.43(0.34)  47.73 (1.96)  50.07 (2.56)  2.52 (0.07) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  5.15 (0.55)  7.00 (0.69)  90.83 (0.89)  87.96 (1.04)  3.19 (0.06) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  32.28 (3.54)  24.59 (3.62)  64.00 (3.01)  69.97 (3.22)  2.37 (0.04) 
No Overlap  
BIPnet  51.75 (4.76)  43.85 (6.06)  0.04 (0.02)  0.05 (0.03)  62.28 (4.52)  67.9 (5.33)  3.42 (0.48) 
BIP  56.7 (5.76)  41.05 (5.67)  0.1 (0.03)  0.14 (0.05)  56.14 (5.4)  70.46 (5.07)  3.26 (0.5) 
CCAReg  88.60( 7.92)  87.10 (2.24)  2.67 (0.62)  2.00 (0.52)  17.38 (2.68)  19.72 (2.99)  3.00 (0.09) 
CCAReg  76.80( 2.61)  73.60 (2.92)  7.90 (1.79)  7.49 (1.93)  28.88 (1.23)  32.62 (1.37)  2.06 (0.06) 
SELPCCA  94.85(1.64)  92.15 (2.09)  0.40 (0.40)  0.01(0.01)  8.57 (2.22)  13.30 (3.36)  3.46 (0.08) 
SELPCCA  67.90(2.39)  64.55 (1.73)  2.60 (1.25)  0.98(0.23)  44.53 (1.81)  50.49 (1.98)  2.35 (0.05) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  6.10 (1.66)  5.11 (1.11)  90.35 (2.21)  91.37 (1.71)  3.18 (0.08) 
FusedCCA  0.00 (0.00)  0.00 (0.00)  32.13 (4.47)  24.84 (5.27)  64.33 (3.44)  72.40 (4.36)  2.50 (0.07) 
5 Application to atherosclerosis disease
We apply the proposed methods to analyze gene expression, genetics, and clinical data from the PHI study. The main goals of our analyses are to: 1) identify genetic variants, genes, and gene pathways that are potentially predictive of 10year atherosclerosis disease (ASCVD), and 2) illustrate the use of the shared components or scores in discriminating subject at high vs lowrisk for developing ASCVD in 10 years.
There were 340 patients with gene expression and SNP data, as well as clinical covariates to calculate their ASCVD score. The following clinical covariates were added as a third data set: age, gender, BMI, systolic blood pressure, lowdensity lipoprotein (LDL), and triglycerides. The gene expression and SNP data were each standardized to have mean zero and standard deviation one for each feature. Details for data preprocessing are provided in web supplementary material Section
10.1.To obtain the group structure of genes, we performed a network analysis using Ingenuity Pathway Analysis (IPA), a software program which can analyze the gene expression patterns using a buildin scientific literature based database (according to IPA Ingenuity Web Site, www.ingenuity.com). By mapping our set of genes with the IPA gene set, we identified genes within gene networks. To obtain the group structure of SNPs ( SNPs), we identified genes nearby SNPs on the genome using the R package bioMart. We found in total genes that will represent groups for the SNP data. The list of those groups for both data types are shown in the web supplementary material, Section 9.
We divided each of the data type into training () and testing (
), ran the analyses on the training data to estimate the latent scores (shared components) and loading matrices for each data, and predicted the test outcome using the testing data. For the other methods, the training data were used to identify the optimal tuning parameters and to estimate the loading matrices. We then computed the training and test scores using the combined scores from both the gene expression and SNP data. We fit a linear regression model using the scores from the training data, predicted the test outcome using the test scores, and computed test MSE.
To reduce the variability in our findings due to the random split of the data, we also obtained twenty random splits, repeated the above process, and computed average test MSE and variables selected. We further attempted to assess the stability of the variables selected by considering variables that were selected at least 12 times (60% ) out of the twenty random splits.
Prediction error and genes and genetic variants selected: For the purpose of illustrating the benefit of including covariates, we also considered these additional models: i) our BIPnet model, using both molecular data and clinical data (age, gender, Body mass index, Systolic blood pressure, lowdensity lipoprotein (LDL), triglycerides) [BIPnet+Cov], (ii) our BIP model, using both clinical and molecular data (BIP+ Cov), and iii) sparse PCA on stacked omics and clinical data (SPCA + Cov). For the sparse PCA, we used the method of Witten et al. (2009) with norm regularization. We were not able to incorporate clinical data in the integrative analysis methods under comparison since they are applicable to two datasets. Figure 5.1 shows marginal posterior probabilities (MPP) of the 4 components for each datatype (SNPs, mRNA and response) for one random split of the data. It shows that for each of the 4 methods, only one component is active (i.e MPP) for gene expression, and two components for SNP data. None of the component is active for the response variable when applied to methods without clinical data (BIPnet and BIP). However, when applied to methods that incorporate covariates (BiPnet+Cov, BIP+Cov), we can obviously identify at least one active component for the response variable. This result shows the importance of incorporating clinical data in our approach.
Table 5.1 gives the average test mean square error and numbers of genes and genetic variants identified by the methods for twenty random split of the data. For FusedCCA, we assumed that all genes within a group are connected (similarly for SNPs). We computed the first and the first four canonical variates for FusedCCA, SELPCCA, and CCAReg using the training dataset, and predicted the test ASCVD score using the combined scores from both the gene expression and SNP data. BPINet+ Cov and BIP + Cov have better prediction performances (lower MSEs) when compared to BIPNet, BIP, and the other methods. We observe from that BIPNet and BIPNet + Cov tend to be more stable than BIP and BIP + Cov (125 genes and 55 SNPs vs 0 genes and 54 SNPs selected at least 12 times). Together, these findings demonstrate the merit in combining molecular data, prior biological knowledge, and clinical covariates in integrative analysis and prediction methods.
We further assessed the biological implications of the groups of genes and SNPs identified by the proposed methods. We selected (MPP0.5) in total 125 genes and 55 SNPs at least 12 times out of the 20 random splits for both BIPNet and BIPNet + Cov (see Figures 10.1 and 10.2 in supplementary material for a random split). Networks 1 and 10 (refer to web supplementary material Table 9.2 for group description) were identified 18 times (90%) out of the 20 random splits of the data. CellToCell signaling and interaction that characterizes network 10 has led to novel therapeutic strategies in cardiovascular diseases (Shaw
et al., 2012). Genes PSMB10 and UQCRQ, and SNPs rs1050152 and rs2631367 were the two top selected genes (highest averaged MPPs) and SNPs respectively across the 20 random splits. The gene PSMB10, a member of the ubiquitinproteasome system, is a coding gene that encodes the Proteasome subunit beta type10 protein in humans. PSMB10 is reported to play an essential role in maintaining cardiac protein homeostasis (Li et al., 2018) with a compromised proteasome likely contributing to the pathogenesis of cardiovascular diseases (Wang and
Hill, 2015). The SNPs rs1050152 and rs2631367 found in genes SLC22A4 and SLC22A5, belong to the gene family SLC22A which has been implicated in cardiovascular diseases. Other genes and SNPs with MPP, on average, are reported in the web supplementary material in Tables 9.1 and 9.3. These could be explored for their potential roles in cardiovascular (including atherosclerosis) disease risk.
Discriminating between high vs lowrisk ASCVD using the shared components:
We illustrate the use of the shared components or scores in discriminating subjects at high vs lowrisk for developing ASCVD in years. For this purpose, we dichotomized the response into two categories: lowrisk and highrisk patients. ASCVD score less or equal to the median ASCVD was considered lowrisk, and ASCVD score above the median ASCVD was considered highrisk. We computed the AUC for response classification using a simple logistic regression model with the most significant (i.e. the highest MPP for our method) component associated with the response as covariates. For the other methods, the AUC was calculated using the first canonical correlation variates from the combined scores. Figure
5.2 shows that the components selected by our methods applied to the omics and clinical data are able to discriminate clearly the two risk groups compared to our methods applied to the omics data only. The estimated AUCs for the omics plus clinical covariates are much higher than those obtained from omics only applications (e.g AUC is 1.00 for the training set and AUC is 0.94 for the test set using BIPnet+Cov). When compared to the twostep methods CCAReg and SELPCCA, the estimated AUCs from our methods are considerably higher than the AUC’s from these methods. These findings demonstrate the importance of onestep methods like we propose here, and also the merit in incorporating both clinical and molecular data in integrative analysis and prediction models.Method  MSE  Average  Average  Genes selected  SNPs selected 

# Genes  # SNPs  times  times  
BIPnet  1.362 (0.033)  128.35  113.65  125  55 
BIPnet + Cov  0.758 (0.125)  122  95.8  125  55 
BIP  1.363 (0.032)  68.7  88.8  0  54 
BIP + Cov  0.766 (0.153)  54.7  82.5  0  54 
SPCA + Cov  1.367 (0.035)  0  5.55  0  0 
SPCA + Cov  1.382 (0.036)  23.95  132.60  0  113 
SELPCCA  1.373 (0.035)  52.35  30.35  1  23 
SELPCCA  1.377 (0.036)  310.650  170.60  174  102 
CCAReg  1.371(0.008)  48.85  43.55  0  0 
CCAReg  1.382 (0.034)  11.30  12.85  0  0 
FusedCCA  1.370 (0.032)  560.2  402.40  561  398 
FusedCCA  1.379 (0.034)  561  418.05  561  421 
6 Conclusion
We have presented a Bayesian hierarchical modeling framework that integrates in a unified procedure multiomics data, a response variable, and clinical covariates. We also extended the methods to integrate grouping information of features through prior distributions. The numerical experiments described in this manuscript show that our approach outperforms competing methods mostly in terms of variable selection performance. The data analysis demonstrated the merit of integrating omics, clinical covariates, and prior biological knowledge, while simultaneously predicting a clinical outcome. The identification of important genes or SNPs and their corresponding important groups ease the biological interpretation of our findings.
Several extensions of our model are worth investigating. First, using latent variable formulations, we can naturally extend our approach to other types of response variables such as survival time, binary or multinomial responses or mixed response outcomes (e.g. both continuous and binary). Second, the proposed method is only applicable to complete data and do not allow for missing values. A future project could extend the current methods to the scenario where data are missing using Bayesian imputation methods or the approach proposed by
Chekouo et al. (2017) for missing blocks of data.Acknowledgements
We are grateful to the Emory Predictive Health Institute for providing us with the gene expression, SNP, and clinical data. Sandra Safo is partly supported by NIH grant 1KL2TR00249202, and Thierry Chekouo is partially supported by NSERC Discovery Grants number RGPIN201904810. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and NSERC .
Supplemental Material
In the online Supplemental Materials, we provide details to sample our parameters, and present more results from simulated data, in particular, sensitivity analysis results. We provide a detailed description of the groups from each of the data types (SNPs and gene expression). We have created an R package called BIPnet (GNU zipped tar file) for implementing the methods. Its source R and C codes, along with the user pdf manual are submitted with this manuscript. The package also contains functions used to generate simulated data. Data used for the analyses may be requested from the Emory Predictive Health Institute.
Supplementary materials for “Bayesian Integrative Analysis and Prediction with Application to Atherosclerosis Cardiovascular Disease”
7 Posterior inference
We present below the conditional distributions used to update the parameters.
7.1 Sampling and
Equation (1) in the main paper can also be written as where is the column vector of with , and is a submatrix of with columns ’s verifying . After integrating , we have
where . In what follows, we removed the superscript “” to simplify formulas. We have then,
(7)  
where and