Modern technology allows us to overcome the challenges of assembling a multi-view dataset with a powerful combination of flexibility and low cost. Human complex diseases are typically expressed by the aberrant interplay of multiple biological sources and environmental factors. While single biomedical data view such as genome (whole genome sequencing), transcriptome (RNA sequencing), epigenome (e.g., DNA methylation), proteome, metabolome, lipidome, and medical images offers an essential framework for detecting biological variants contributing to complex diseases, an in-depth understanding of biological mechanisms by examining only single data may not suffice. Most biomedical data analysis methods are applicable to single or dual-views of datasets, without making full use of multi-view datasets. In addition, multi-view biomedical datasets are often complementary to each other and correspond to different feature spaces [11, 30, 23, 34, 17]. Thus, their comprehensive and systematic analysis has the potential to uncover unknown biological mechanisms of complex diseases (Figure 1). With the advent of recent technologies that make assembling of multi-view biomedical datasets (e.g., multi-omics and imaging scans) of a complex diseases possible, development of efficient analytic frameworks for such datasets is becoming an active area of scientific inquiry [29, 35, 20, 53, 41]
. For an effective diagnosis of a complex disease, different types of medical imaging techniques, including computer tomography (CT), ultrasound, functional magnetic resonance imaging (fMRI), structural MRI (sMRI), positron emission tomography (PET) scans, diffusion tensor imaging (DTI), and X-Ray are used[46, 3, 9, 16, 12].
The high demand for characterizing complex diseases in biomedical science and the available technologies justify the need for more effective biomedical data integration approaches. Linear data integration approaches including supervised, unsupervised, semi-supervised learning, multi-omics factor analysis, and highlight matrix factorization methods are used to gain a deeper, more holistic understanding of the biological mechanism for complex diseases[39, 30, 6, 36, 18]. These are widely used and validated approaches but they perform poorly when data structure is non-linear and data comes from a multi-modal distribution . As a consequence, non-linear integrated approaches (e.g., kernel based machine) are an important feature in comprehensive analysis of multi-view biomedical datasets [31, 8, 26]. The positive definite kernel based machine approach can overcome the non-linearity problem as well as that of dimensionality of multi-view biomedical datasets [52, 57, 1, 2, 42]. This approach can be useful in effective integration of multi-view biomedical data.
In this study, we introduce a generalized kernel machine approach to identify higher-order composite effects (GKMAHCE) in multi-view datasets. While several kernel machine approaches have been employed to identify marginal, interaction effects of datasets with continuous data, the underlying challenge remains of how to make a generalized kernel machine based model to estimate marginal, interaction, and composite effects of multi-view dataset with categorical data, which is often the case in complex diseases[55, 7, 56, 21, 10]. This proposed approach considers multi-view data as predictor variables to allow more thorough and comprehensive modelling of complex traits.
Here, we compared the performance of our proposed method with that of existing methods using synthesized datasets and real multi-view adolescence brain development and osteoporosis datasets, with brain imaging and five- omics data. To realize adolescence brain development, sex differences in human brain are essential to understand their anatomical foundations in the brain with three view datasets: genomic, non-genomic and fMRI datasets . Osteoporosis is a bone disorder, which results from a loss of bone mineral density (BMD) . For more information of these data, we refer the reader to Section . Within each dataset, this study explores novel genome, epigenome, transcriptome, and chemical mechanisms, as well as robustly and efficiently identifies corresponding factors. To validate the results, we performed Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, the gene ontology analysis of biological process category and gene-gene interaction networks analysis. We also draw causal connections, gene- gene associations, and gene-chemical associations for both of the datasets. This paper is highly novel in the following aspects:
To the best of our knowledge, we address a generalized kernel machine approach to identify higher-order composite effects.
For statistical testing of the marginal, the interaction, and the composite effects, we derive the test statistics.
To show that the proposed approach is highly efficient, we experiment with synthesized datasets and three-view of brain development in adolescence and five-view of osteoporosis datasets. As far as we know, this is the first five omics integration study.
Finally, we perform a comprehensive pathway and network-based analysis for the biological validation of the selected genes.
The rest of this paper is organized as follows. In Section 2, we discuss some contemporary related work. In Section , we explain the proposed approach, generalized kernel machine approach to identify higher-order composite effects in multi-view datasets and test statistics for statistical testing for higher order composite effects.. In Section , we describe the experiments conducted on both synthesized datasets and two real datasets: imaging and genetics for adolescence brain development and multi-omics datasets of osteoporosis studies. We conclude the paper with a discussion on major findings and future research in Section . Detailed derivation of the proposed method along with Satterthwaite approximation to the score test and the applications to real dataset can be found in the supplementary materials.
2 Related work
Kernel based multi-view data integration approaches allow the joint analysis of multiple data types to provide a global view of the biological functions and offer insights into the nature of the interactions between the different datasets. These methods offer useful ways to learn how an extensive collection of genetic variants are associated with complex phenotypes and can be used to explore the relationship between genetic markers and a disease state [10, 37, 24].
The marginal, interaction, and composite effect identifications are becoming a common challenge to multidimensional imaging and multi-omics data analysis. Preliminary works in kernel machine methods have been boldly pursued by , where a single modal dataset was used to test for a genetic pathway effect.  also proposed a kernel machine-based method for gene-gene interactions. They treated each gene as a testing unit for gene-gene interactions. A kernel machine method was then proposed for detecting multiple factor interactions where a smoothing spline-ANOVA decomposition method was adopted . However, these approaches only use single or pairwise datasets. A sequence kernel association tests (SKAT) for genome wide association studies is introduced in . While SKAAT is widely used in genomics studies, it is limited to only a single omics. In addition,  have proposed a kernel machine method for detecting the effects of interactions between multidimensional variable sets. This is an extended model of  to jointly model genetic and non-genetic features and their interactions. A microbiome regression based kernel association test (MiRKAT) has been proposed by , which directly regresses the outcome on a single omics dataset, microbiome profiles, through the semi-parametric kernel machine regression framework.
Recently,  has proposed a kernel machine method for detecting higher-order interactions in three views datasets which was applied to schizophrenia with a continuous trait . Although a generalized version of such methods for more than three datasets as well as both categorical and continuous traits remains elusive, such a method for multi-view biomedical data analysis is necessary to discover new information about biological systems and complex diseases.
Suppose we have independent identical distributed (IID) subjects with covariates and m-view datasets, . Assuming that follows a distribution in the exponential family with density
where and are the location and scale parameters, and are known functions, and
is know weight, respectively. The mean and variance ofsatisfy and . In the following generalized semiparametric model, we associate the output with covariates including intercept and -view datasets:
where is a known monotone link function, is a vector of covariates including the intercept for the th subject, is a vector of fixed effects, is an unknown function on the product domain, with .
|Error family||Link||Inverse of link||Use|
|Gaussian||Identity, g(y)=y||Normally distributed set of data,|
|Poisson||Log, g(y)=log(y)||Exponential, exp(y)||equidispersed count set of data|
|Binomial||Logit,||Binary set of data, 0/1|
|Gamma||Inverse gamma||Positive continuous set of data,|
According to the ANOVA decomposition, the function, can be extended as:
where, ’s are the main effects of the respective dataset (), ’s are pairwise interactions effects, ’s are the interactions effects of three datasets and so on. The functional space, RKHS, decomposes as:
equipped with an inner product, and a norm
This generalizes kernel regression by allowing the model to be related to the response variable via a link function including kernel regression, logistics kernel regression, and Poisson kernel regression. For the binary data, the link functionprovides the logistic kernel regression; for the count data, provides the Poisson kernel regression; for Gaussian data, gives classical kernel regression. The main aspect of this paper is applying this to five views: genome, epigenome, transcriptome, metabolome, and lipidome along with Low BMD and High BMD information of the subject. To do this, assume that we have IID subjects under investigation; is a binary (LBMD or HBMD) phenotype for the -th subject. We associate the clinical covariates (e.g., age, weight, height) with five views including genome, epigenome, transcriptome, metabolome, and lipidome.
Let denoted the covariates, where is a measure of the -th subject.
Let , , , , and , are a gene with SNP markers, a gene with methylation profiles, a gene with RNA-Seq, a chemical taxonomy class with metabolite profiles, and a chemical class with lipid profiles of the -th subject, respectively. Under this setting, Eq. (2), Eq. (3) and Eq. (4) becomes:
3.1 Model estimation
We can estimate the function
by minimizing the penalized loss function of Eq. (5) as:
where L is a loss function, is a roughness penalty with tuning parameter
. By considering the negative log-likelihood of the binomial distribution,
the problem becomes as a kernel logistic regression problem:
It is known that the complete function space of Eq. (3), , has the orthogonal decomposition. Hence the function can be decomposed accordingly (see the supplementary material for details).
It is convenient and mathematically friendly to minimize numerically. Let
be the probability of theobservation and
Following many in the literature, we have a liner mixed effects model for such that
where is a coefficient vector of fixed effects, , , , , , and are independent random effects with normal distributions. The relationship of Eq. (6) and Eq. (8) confirm that all of the effects obtained by optimizing the loss function in Eq. (6), are the equal to the best linear unbiased predictors (BLUPs) of the linear mixed effects model in Eq. (8). This relationship makes a accessibility for testing the variance component instead of testing the nonparametric function in hull hypothesis, which we can estimate using the restricted maximum likelihood (ReML) approach (see the supplementary material for details).
3.2 Statistical testing
In this section, we address the test statistic of the overall effect, marginal effects, different interaction effects, and different composite effects.
3.2.1 Overall hypothesis testing
According to our model, the testing overall effect is
is equivalent to test the variance components in Eq.(8),
It is known that kernel matrices are not block-diagonal and the parameter in variance component analysis are placed on the edge of the parameter space when the null hypothesis is true. While the asymptotic distribution of a likelihood ratio test (LRT) statistic is neither a chi-square distribution nor a mixture chi-square distribution under the null hypothesis, we can use a score test statistic on the restricted likelihood[3, 28]. We derive score test statistic for a generalized kernel machine approach in multi-view datasets, Eq. (8). Assuming that , where . We can write the restricted log-likelihood function of Eq. (8) as follows
Using the partial derivative of Eq. (9) with respect to each variance component. We derive the estimate of the variance components and the score test statistic which is defined as
where and is the maximum likelihood estimator (MLE) of the logistic regression coefficient under the null model , is the variance of . is quadratic function of the variable , which follows a weighted mixture of the chi-square distribution under the null hypothesis. By the Satterthwaite method, we can approximate the distribution of to a scaled chi-square distribution, (). To estimate the the scale parameter
and the degrees of freedom
we can we can use the method of moments on the mean () and variance () of the test statistic and we obtain and . Finally, to get the value of an experimental score statistic we use the scaled chi-square distribution .
3.2.2 Testing marginal effects
3.2.3 Testing interaction effect
By considering no marginal effects, i.e., , we can also test the interaction effects. We know that the testing interaction effect
is equivalent to test the variance components in Eq.(8),
The test statistic is
Also by considering no marginal effects, i.e., , and all second order interactions are zero we can also test the 3rd order interaction effects.
The 3rd order test statistic is
and so on.
3.2.4 Testing composite effects
If lower order effects are statistically significant, we want to test higher order effects is called composite hypothesis testing. To test the 3rd order interaction effect, we mansion that testing the null hypothesis is equivalent to testing the variance component: . Let , and all , and are model parameters under the null model.We formulate a test statistic
where , and is the projection matrix under the null hypothesis.
Similar to the overall effect test, we can use the Satterthwaite method to approximated the distribution of higher order intersection test statistic by a scaled chi-square distribution with scaled and degree of freedom i.e., . The scaled parameter and degree of freedom are estimated by MOM, and , receptively. In practices, the unknown model parameters are estimated by their respective ReML estimates under the null hypothesis. Lastly, the value of an observed higher order interaction effect test score statistic is obtained using the scaled chi-square distribution . and so on.
3.2.5 Kernel machine based model selection
The result of kernel-based machine learning approaches is highly influenced by the kernel and its parameters. The model selection with a suitable kernel is essential in machine learning. As a result, machine learning approaches suffer from weak selection of a suitable model. Supposeis a positive definite kernel. A linear positive definite kernel, , where , on can define the similarity measure in the Euclidean space. This liner kernel suffers from more complexity in the function space for high-dimensional datasets. Instead of linear kernel, we can use a polynomial kernel to make it possible to use higher order correlations among data points. A polynomial kernel, has two free parameters: c is a tradeoff between higher order and lower order in the polynomial and d is the degree of the polynomial. The dependency measure depends on the finite bounded degree. Another limitation of linear and polynomial kernel is boundedness, i.e., both kernels are unbounded.
From a finite dimensional space the Gaussian kernel can map the input space (data space) into a infinite dimensional space  and is defined as:
It has a free parameter to select but consist of numerous theoretical properties including boundedness, consistence, universality, robustness, and so on . To fix the free parameter, we can use the median of the pairwise distance[15, 44].
For a multi-omics study, to captures the pairwise similarity across a number of SNPs in each gene we can use a identity-by-state (IBS) kernel (nonparametric function of the genotypes) :
where is the number of SNP markers of the corresponding gene. It is assumption free on the type of genetic interactions. Hence, it can capture any genetic effects on the phenotype. In this paper, we used the IBS for the genetic dataset and for all other datasets, respectively.
We conducted experiments on both the Monte Carlo simulation and two real multi-view datasets: an imaging and genome datasets from the adolescence brain development and five omics datasets from osteoporosis studies, respectively. The goal is to select a feature based on significant composite effects among the datasets. We considered the Gaussian kernel (the median of the pairwise distance as the bandwidth) For all datasets but the genetic dataset. For the genetic dataset, we considered the IBS kernel [15, 3]. The optimization of the proposed approach is based on Fisher’s scoring algorithm (the ReML algorithm). To that end, we follow the parameters setting as in  for the ReML algorithm applicable. To overcome the potential of a local minima, we considered a set of initial points in (0, 1) and picked the point which maximized the ReML algorithm.
We compare our simulation and real data results with three other methods: principal component SKAT, partial principal component regression (pPCAR), and full principal component regression (fPCAR)) as logistic regression setting. The SKAT approach is a flexible and computational in GWAS [51, 21]
. To identify an interaction effect of two genes, a principal component analysis based approach has been proposed by.  has been considered each method as a simple regression setting. Here, we switched each method as a logistic regression setting.
4.1 Simulation studies
To evaluate the performance of the proposed method, we do simulation studies. To that end, we use the following model to generate synthesized qualitative phenotype:
where is a vector of covariates (height and weight) with an intercept of -th subject () and is a coefficient vector, the random error,
, follows the Gaussian distribution with zero mean and unit variance and
is the standard deviation of the error and is fixed to. Here ’s represent five different datasets. Each function ’s is defined as in different non-linear form.
In this setting, we simulated data under different values of parameters to evaluate the performance of the test. For example, means all effects have vanished and we can examine the false positive rate for the score test under the overall effect. Similarly, we take into account the main effects (2nd order interaction effects) but no higher order interaction effects to evaluate the power of the score test. To get consist results, we performed simulations for each setting.
The simulation results of different methods can be found in Table 2. In this table, we report the power of that higher-order composite score test in different parameter settings. Overall, we made two observations. First, we noticed that the false positive rate of the test for higher-order composite effects score test was controlled by fixing the nominal value threshold to . All four methods have the same observations for the false positive rate. Second, when considering the power analysis (), we observed that the proposed method performs better than other methods and its power exceeds (Table 2). On the other hand we observed that the state-of-the-art methods (pPCAR, fPCAR and SKAT) can significantly overstate the false positive rates and result in significant loss of statistical power.
We also plot several visualization of the ROC with all interrelated parameters () values are random but the linear parameter is fixed to . We allocate each number with probability
. A random number is uniformly distributed either in a range or at. In Figure 2, we look into the results from the receiver operating characteristic (ROC) with three sample sizes, . The sensitivity is plotted against (1- specificity) for each -value in the range of with a step size . The power gain of the proposed method relative to the alternative ones is evident in all situations.
|Parameters||Simulation - I|
|(, , , , )|
|(0.1, 0, 0, 0, 0)|
|(0, 0, 0, 0, 0.1)|
|(0, 0, 0, 0, 0.5)|
|(0, 0, 0, 0, 1)|
|(1, 1, 1, 1, 0.1)|
|(1, 1, 1, 1, 0.5)|
|(1, 1, 1, 1, 1)|
4.2 Real data analysis
We apply the proposed method to two real multi-view datasets including an fMRI imaging dataset and five omics datasets from the adolescence brain development and osteoporosis studies, respectively.
4.2.1 Application to adolescence brain development
The Philadelphia Neurodevelopmental Cohort (PNC) is a research initiative that focuses on characterizing brain and behavior interaction with genetics. Sex differences in human brain are essential to understand their anatomical foundations in the brain [40, 49]. Our goal here is to select features using the test composite effect on the sex differences in cognition in youth. To that end we consider three view datasets: genomic, non-genomic and fMRI imaging datasets. Table 3 presents the configuration of the three datasets along with phenotype. The gene consists of genetic features (SNPs); the non-genomic data consists of clinical measurements; and the ROI consists of imaging features (voxels). We consider each gene, all non-genomic data and each ROI as a unique testing unit. To apply the proposed method, first we selected most significant genes out of genes using the SKAT approach and ROIs out of (since ROIs contain missing values). In total, we get ( triplets to test. The overall test gives us significant triplets () out of . In Figure 3, we visualize the plot of for triplets. The vertical solid, doted and double doted lines correspond to the p-values of , and , respectively. At p-values of and we detected and significant genes, as well as and significant ROIs, respectively.
|421 male and 377 female|
Table 4 presents the ReML estimates of all parameters, , , , , , , , and the -values for both the proposed and SKAT methods for each of the triplets. By the proposed method, these triplets were identified to have significant interactions at a level of . At this -value, we observe unique genes (RPA3OS, ESR1, WWOX, RGS7, PALLD) and ROIs (CFL.L: Left cerebrum frontal lobe; CSL.R: Right cerebrum sub lobar; CLL.R: Right cerebrum limbic lobe; CSS.R: Right cerebrum cub-lobar; CFL.R: Right cerebrum frontal lobe), that may have significant effects of sex on adolescence brain development [5, 33].
To test whether genes interact with each other, we constructed a gene-gene interaction network analysis using STRING which imports gene/protein association knowledge from databases on both physical interactions and curated biological pathways . In STRING (e.g. STRING 11.0), the simple interaction unit is the functional relationship between two genes. This relationship can play a role to a common biological purpose. Figure 4 illustrates the gene-gene network based on the protein interactions with the selected genes ( including 7 individual selected genes). In this figure, the color saturation of the edges represents the confidence score of a functional association. Further network analysis shows that the number of nodes, number of edges, expected number of edges, average node degree, clustering coefficient, -values are , , , , for , respectively. This analysis confirmed that the selected genes had significantly more interactions than expected and indicates that these genes may function in a concerted manner.
For investigating classification accuracy of the proposed method, the genome, non-genome, fMRI imaging datasets and only selected features via the proposed method are used to classify the sex-related brain differences followed by the three classifiers: the K-nearest neighbors algorithm (KNN) and the support vector machine (SVM) (the linear kernel and the Gaussian kernel). For the proposed approach, we considered triplets that have significant effects on the gender:genes and ROIs only (from Table 4). The classifier is used to distinguish the adolescence brain development between the female and the male. Table 5
presents the train classification error with different datasets and the selected features only. These results demonstrated that the proposed method is a better tool for feature selection.
|Dataset||K-NN algorithm||Linear kernel||Gaussian kernel|
|Only selected features||21.05||0.07|
4.2.2 Application to osteoporosis study
Osteoporosis is a bone disorder which is largely attributable to the increased bone resorption and/or decreased bone formation by osteoclasts and osteoblasts, respectively. We apply the proposed method to a multi-omics datasets from an osteoporosis study [54, 32, 4]. This dataset included genome, epigenome, transcriptome, metabolome, and the lipidome data from Caucasian females with high BMD and with low BMD. Table 6 presents the configuration of the five datasets along with phenotype. The original genomic data consists of SNPs which annotated to 25, 442 genes. The epigenome data consists of methylations which annotated to genes. This methylation analysis focus on gene level/sliding windows. The transcriptome data has genes expression profiles. The metabolome data contains of chemical taxonomies including metabolic profiles. The lipidome data covers chemical taxonomies including lipid profiles.
|Covariate||Age, Height and Weight|
|Phenotype||Bone mineral density (BMD)|
|Low BMD and High BMD|
|ID||ID||Score||Score||Score||(Total No.): PMID||Disease|
|ABL1||GH09J130833||: , ; ;||Bone marrow; Brain, Endometrium; Lymphy node|
|AP1G1||GH16J071807||; ; ; ;||Bone marrow; Kidney; Brain, Testis; Thyroid|
|GCNT7||GH20J056524||: ;||Bone marrow; thyroid|
|PANK1||GH20J056524||: ; ; ; ;||Liver; Kidney; Bone marrow; Thyroid|
|ZNF35||GH03J044648||: ; ; ; ;||Testis; Thyroid; Urinary bladder, Endometrium; Brain, Bone marrow|
|ZNF114||GH07J151024||: ; ; ; ;||Thyroid, Brain; Placenta|
To apply the proposed method, we considered each gene of the genome, epigenome, and transcriptome data and each chemical taxonomy of the metabolome and lipidome data as a single testing unit. To make it feasible, we reduced the dimensionality of the genome, epigenome and transriptome data using four approaches including the kernel based gene shaving approach . Figure 5
presents the Venn-diagram of the t-test, canonical correlation analysis based gene shaving (CCAOut), kernel canonical correlation analysis based gene shaving (KCCAOut) and Linear Models for Microarray based gene shaving (LIMMA) methods for three datasets: (a) genome data (b) epigenome data, and (c) transcriptome data. From this figure, we observed that the number of disjointedly selected genes from CCOut, LCCAOut, KCCAOut and LIMMA methods for the genome data are, , , ; for the epigenome data are , , , ; and for the transcriptome data are , , , , respectively. All methods selected , and common genes for the genome, epigenome and transcriptome data, respectively. In addition, we have and chemical taxonomies of metabolome and lipidome dataset, respectively. In total, we get ( quinlets to test. Figure 6, shows the plot of for quinlets. The overall test gives us significant quinlets (). The vertical solid, doted and double doted lines correspond to the p-values of , and , respectively. Table 7 presents the number of significant genes and chemical taxonomies using different adjusted p-values of the proposed method. A list of genes (genome, epigenome and transcriptome data) and list of chemical taxonomies (metabolome and lipidome data ) are tabulated in Table 8.
We also extracted the gene-gene interaction networks using STRING: functional protein association networks. Figure S1, (in the Supplementary material) shows the gene-gene networks based on the protein interactions among the selected () genes by the proposed method of the genome, epigenome, and transcriptome datasets. In this figure, the color saturation of the edges represents the confidence score of a functional association. In addition, network analysis demonstrates that the number of nodes, number of edges, expected number of edges, average node degree, clustering coefficient, PPI enrichment p-values are and , respectively. This network analysis confirms that the selected genes have significantly more interactions than expected. It also indicates that the selected genes may function collaboratively.
To confirm the biological roles of the most prominent identified features, we accessed the biomedical and genomic information using the national center for biotechnology information (NCBI, https://www.ncbi.nlm.nih.gov/) tools and the Human gene database (GeneCards: https://www.genecards.org/) . NCBI offers a wide variety of data analysis tools for manipulating, aligning, visualizing, and evaluating biological data. On the other hand, the primary goal of the GeneCards database is the unequivocal identification of enhancer elements and uncovering their connections to genes for understanding gene regulation and the molecular pathways. Recent studies have shown that the bone marrow, kidney, testis, and thyroid diseases have widespread systemic manifestations including their effects on developing osteoporosis. [38, 22, 48, 50]. So, we hypothesized that the selected genes, with the proposed method, would provide more ubiquitous expression (reads per kilo base per million mapped reads (RPKM)) of bone marrow, kidney, testis, and thyroid tissue samples. To that end, we used NCBI’s tools. Figure 7 illustrates the expression patterns of three selected genes (ABL1, GCNT7, ZNF144) across different tissues from 95 human individuals. These results show that these genes have remarkably high RPKM value for bone marrow, kidney, testis, and thyroid tissue samples. These tissue related diseases including their effects may play a role in the development of osteoporosis. To provide insight into the gene regulatory elements (promoters and enhancers) for 6 selected genes (ABL1, AP1G1, GCNT7, PANK1, ZNF35, ZNF144), we used the GeneCards database. Table 9 presents GeneHancer identifier, GeneHancer score, gene association score, total score, major-related diseases, and PubMed database. From this table, we observed that the selected genes have had a remarkable GeneHancer score, gene association score, total score, and literature review in the past studies. According to the disease annotation, the selected genes are highly associated with complex diseases, which are higher risk of developing osteoporosis (the results of all selected genes are in Supplementary material).
To further explore whether that the selected chemical taxonomies of metabolic and lipidomic features will have unique relationships in the low BMD and high BMD groups. We infer a causal relationship with a Hill-Climbing approach among the select chemical taxonomies in each group of both datasets (metabolome and lipidome). The causal relationship of the metabolome and the lipidome datasets: (a) low BMD and (b) high BMD is illustrated in Figure 8 and Figure 9, respectively. In both datasets, we observed that the causal relationships are different in both the low BMD and high BMD for both datasets. For the metabolic data TTRD has an important impact on the low BMD but has an negative impact on high BMD but DA has an important impact on the high BMD. In case of the lipidome data, we observed that the LA has an impact on the low BMD but does not have any impact in the high BMD. By this observation, we concluded that the selected chemical taxonomies may contribute on BMD status but not in general.
Finally, we investigated classification accuracy via only selected features of the proposed method. To classify the low BMD subjects from the high BMD subjects followed by the three classifiers: the KNN and the SVM (the linear kernel and the Gaussian kernel). For the proposed approach, we considered triplets that have significant (from Table 8) effects on the BMD. The classifier is used to classify the low BMD subject from the high BMD subject. Table 10 presents the train classification error with different datasets and the selected features only. These results are also demonstrating that the proposed method is a better tool for feature selection.
|Dataset||K-NN algorithm||Linear kernel||Gaussian kernel|
|Only selected feature|
5 Concluding remarks
The technology of biomedical science has accelerated the cycle of discovery. In this we propose a novel generalized kernel machine approach to identify higher-order composite effects in multi-view biomedical datasets. This semi-parametric approach considers multi-view data as prediction variables to allow comprehensive modeling of complex disease trait.
The power of the proposed method is further demonstrated by its application to both synthesized and two real multi-view datasets, i.e., adolescence brain development and osteoporosis study. In simulation studies, we have shown that the false positive rate of the test for higher-order composite effect is mitigated by fixing the nominal p-value threshold along with other state-of- the-art-methods. From the power analysis, our proposed method not only performs better than other methods. The ROC curves also showed the gain of power by the proposed method in all scenarios.
In adolescence brain development experiments, we found five unique genes and 6 ROIs along with the non-genomic factor that may have significant gender effects on adolescence brain development. In addition, by the gene-gene interaction networks, we confirmed that the selected genes have significantly more interactions than expected. This result implies that the gens and their functional interactions form the backbone of the cellular machinery.
In the osteoporosis study, the network analysis confirmed that the selected genes of genome, epigenome, and transcriptome datasets also have significantly more interactions than expected. It also indicates that the selected genes may function in a collaborative effort. We found that they function in a concerted effort and have biological relevance to osteoporosis. Furthermore, using a causal relationship, we conclude that the selected chemical taxonomies (metabolome and lipidome) may contribute to development of BMD status.
The primary focus of our paper is to comprehensively characterize complex disease (e.g., adolescent brain development, or osteoporosis ). Collectively, we can use this approach to derive a statistic for testing the composite effect. These studies explore novel genomic, epigenomic, transcriptomic and chemical mechanisms to identify corresponding factors. The significant triplet, quartet, quintet of extracted features suggest that these effects might highlight biological targets for drug development. Extrapolating these findings enable an advanced understanding of normal cellular processes. Our proposed method can be applied to the study of any disease models, where multi-view data analysis is commonly used.
We acknowledge that the kernel machine is sensitive to contaminated data, even if bounded positive definite kernels are used. Further work on robust kernel methods including M-estimator based methods. They will allow us to uncover the rich, hidden connections in biological system, and may provide a more comprehensive modeling of complex traits.
This work was partially supported by grants from National Institutes of Health (NIH) [U19AG05537301, R01AR069055, P20GM109036, R01MH104680, R01AG061917], and the Edward G. Schlieder Endowment and the Drs. W. C. Tsai and P. T. Kung Endowment to Tulane University.
-  M. A. Alam. Kernel Choice for Unsupervised Kernel Methods. PhD. Dissertation, The Graduate University for Advanced Studies, Japan, 2014.
M. A. Alam, V. Calhoun, and Y. P. Wang.
Influence function of multiple kernel canonical analysis to identify outliers in imaging genetics data.Proceedings of 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB),Seattle, WA, USA, pages 210–2198, 2016.
-  M. A. Alam, Hui-Yi Lin, Hong-Wen Dengc, V. Calhoun, and Y. P. Wang. A kernel machine method for detecting higher order interactions in multimodal datasets: Application to schizophrenia. Journal of Neuroscience Methods, 309:161–174, 2018.
-  M. A. Alam, Mohammd Shahjaman, Md. Ferdush Rahman, Fokhrul Hossain, and Hong-Wen. Gene shaving using a sensitivity analysis of kernel based machine learning approach, with applications to cancer data. PLOS One, 14(5):e0217027, 2019.
-  Peter F Bladin Michael M Saling Amee D Baird, Sarah J Wilson and David C Reutens. Neurological control of human sexual behaviour: insights from lesion studies. Journal of Neurology, Neurosurgery & Psychiatry, 74(10):1042–1049, 2007.
-  Ricard Argelaguet, Britta Velten, Damien Arnol, and et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Molecular System Biology, 14:e8124, 2018.
-  Matteo Bersanelli, Ettore Mosca, Daniel Remondini, and et al. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics, 17 (15):167–202, 2016.
-  K. M. Borgwardt and et al. Protein function prediction via graph kernels,. Bioinformatics, 21:i47–i56, 2005.
-  Joana Cabral, Morten L. Kringelbacha, and Gustavo Decoc. Functional connectivity dynamically evolves on multiple time-scales over a static structural connectome: Models and mechanisms. NeuroImage, 160:84–96, 2017.
-  G. Camps-Valls, J. L. Rojo-Alvarex, and M. Martinez-Romon. Kernel Methods in Bioengineering, Signal and Image. Idea Group publishing, London, 2007.
-  Sebastian Canzler, Jana Schor1, Wibke Busch, Kristin Schubert, and et al. Prospects and challenges of multi‑omics data integration in toxicology. Archives of Toxicology, 94:391–388, 2020.
-  Gustavo Deco, Giulio Tononi, Melanie Boly, and Morten L. Kringelbach. Rethinking segregation and integration: contributions of whole-brain modelling. Nature Reviews Neuroscience, 2015.
-  Simon Fishilevich, Ron Nudel, and et al. Genehancer: genome-wide integration of enhancers and target genes in genecards. Database, 2017:1–17, 2017.
-  T. Ge, T. E. Nichols, D. Ghoshd, E. C. Morminoe, am M. R. Sabuncu J. W.Smoller, and the Alzheimer’s Disease Neuroimaging Initiative. A kernel machine method for detecting effects of interaction between multidimensional variable sets: An imaging genetics application. NeuroImage, 109:505–514, 2015.
-  A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, 20:585–592, 2008.
-  Enrique C.A. Hansen, Demian Battaglia, Andreas Spiegler, Gustavo Deco, and Viktor K. Jirsa. Functional connectivity dynamics: Modeling the switching behavior of the resting state. NeuroImage, 105:525–535, 2015.
-  Y. Hasin, M. Seldin, and A. Lusis. Multi-omics approaches to disease. Genome Biol, 18 (83):1:15, 2017.
-  T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2009.
-  T. Hofmann, B. Schölkopf, and J. A. Smola. Kernel methods in machine learning. The Annals of Statistics, 36:1171–1220, 2008.
-  S. Huang, K. Chaudhary, and L. X. Garmire. More is better: Recent progress in multi-omics data integration methods. Frontiers in Genetics, 8:Article 8, 2017.
-  I. Ionita-Laza, S. Lee, V. Makarov, J. D. Buxbaum, and X. Lin. Sequence kernel association tests for the combined effect of rare and common variants. American Journal of Human Genetics, 92:841–853, 2013.
-  Pascale Khairallah and Thomas L. Nickolas. Management of osteoporosis in ckd. Clinical Journal of the American Society of Nephrology, 13(6):962–969, 2018.
-  Minseung Kim and Ilias Tagkopoulos. Data integration and predictive modeling methods for multi-omics datasets. Molecular Omics, 14:8–25, 2018.
-  S. Y. Kung. Kernel Methods and Machine Learning. Cambridge University Press, New York, 2014.
-  X. Lin D. Ghosh M. P. Epstein L. C. Kwee, D. Liu. A powerful and flexible multilocus association test for quantitative traits. Annals of Human Genetics, 82(2):386–397, 2008.
-  G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20:2626–2635, 2004.
-  S. Li and Y. Cui. Gene-centric gene-gene interaction: a model-based kernel machine method. The Annals of Applied Statistics, 6(3):1134–1161, 2012.
D. Liu, X. Lin, and D. Ghosh.
Semiparametric regression of multidimensional genetics pathway data: least squares kernel machines and linear mixed model,.Biometrics, 630(4):1079–1088, 2007.
-  Feiyang Ma, Brie K. Fuqua adn Yehudit Hasin, and et al. A comparison between whole transcript and 3’ rna sequencing methods using kapa and lexogen library preparation methods. BMC Genomics, 20 (9):1–12, 2019.
Tianle Ma and Aidong Zhang.
Integrate multi-omics data with biological interaction networks using multi-view factorization autoencoder (MAE).BMC Genomics, 20 (11): 994::1–11, 2019.
-  A. C. A. Nascimento, R. B. C. Prudêncio, and Ivan G. Costa. A multiple kernel learning algorithm for drug-target interaction prediction. BMC Genomics, 17:46:1–16, 2016.
-  Chuan Qiu, Fangtang Yu, Kuanjui Su, Lan Zhang, and et al. Multi-omics data integration for identifying osteoporosis biomarkers and their biological interaction and causal mechanisms. iScience, 23(2):100847, 2020.
-  W. B Bilker R. C. Gur, F. Gunning-Dixon and et al. Sex differences in temporo-limbic and frontal brain volumes of healthy adults. Cereb Cortex, 12(9):998–1003, 2002.
-  N. Rappoport and R. Shamir. Nulti-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Research, 46 (20):10546–10562, 2018.
-  Nicholas J. W. Rattray, Nicole C Deziel, Joshua D. Wallach, and et al. Beyond genomics: understanding exposotypes through metabolomics. Human Genomics, 12 (4):1–14, 2018.
-  Marylyn D. Ritchie, Ruowang Li Emily R. Holzinger and, Sarah A. Pendergrass1, and Dokyoon Kim. Methods of integrating data to uncover genotype–phenotype interactions. Nature Review: Genetics, 16:88– 97, 2015.
-  B. D. Moor S. Yu, L-C. Tranchevent and Y. Moreau. Kernel-based Data Fusion for Machine Learning. Springer, Verlag Berlin Heidelberg, 2011.
-  A. Sanghani-Kerai, L. Osagie-Clouard, G. Blunn, and M. Coathup. The influence of age and osteoporosis on bone marrow stem cells from rats. Bone & Joint Research, 7(4):289–297, 2018.
-  Anita Sathyanarayanan, Erik W Thompson Rohit Gupta, and et al. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping. Briefings in Bioinformatics, bbz121:e8124, 2019.
-  F. E. Satterthwaite. An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6):110–114, 1946.
-  Andreas Schmidt, Ignasi Forne, and Axel Imhof. Bioinformatic analysis of proteomics data. BMC Systems Biology, 8 (53):1–7, 2014.
-  B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge MA, 2002.
-  T Sölzen and et al. An overview and management of osteoporosis. European Journal of Rheumatology, 4:46–56, 2017.
-  L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13:1393–1434, 2012.
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and
Kernel choice and classifiabilit forRKHS embeddings of probability distributions.Advances in Neural Information Processing Systems, 21:1750–1758, 2009.
-  Jing Sui, Rongtao Jiang, Juan Bustillo, and Vince Calhoun. Neuroimaging-based individualized prediction of cognition and behavior for mental disorders and health: Methods and promises. Biological Psychiatry, BPS 14136:Articles in Press, 2020.
-  D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, h A. Rot, A. Santos, M. Kuhn K. P. Tsafou, and, P. Bork, L. J. Jensen, and C. von Mering. STRING v11.0: Protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research, 43:531–543, 2007.
-  Dominika Tuchendler and Marek Bolanowski. The influence of thyroid dysfunction on bone metabolism. Tuchendler and Bolanowski Thyroid Research, 7(12):1–5, 2014.
-  Birkan Tunc, Berkan Solmaz, and et al. Establishing a link between sex-related differences in the structural connectome and behaviour. Philosophical Transactions Society B, 371:20150111, 2015.
-  P. M. Willemse, N. A. T. Hamdy, and et al. Changes in bone mineral density in newly diagnosed testicular cancer patients after anticancer treatment. The Journal of Clinical Endocrinology & Metabolism, 99(11):4101–4108, 2014.
-  M. C. Wu, S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT). American Journal of Human Genetics, 89:82–93, 2011.
-  K. K. Yan, H. Zhao, and H. Pang. A comparison of graph- and kernel-based – omics data integration algorithms for classifying complex traits. BMC Bioinformatics, 18:539:1–13, 2017.
-  Kui Yang and Xianlin Han. Lipidomics: Techniques, applications, and outcomes related to biomedical sciences. Trends in Biochemical Sciences, 41 (11):954–969, 2016.
-  Fangtang Yu, Chuan Qiu, Chao Xu, , and et al. Mendelian randomization identifies cpg methylation sites with mediation effects for genetic influences on bmd in peripheral blood monocytes. frontiers in Genetics, 11(60):1–14, 2020.
-  Irene Sui Lan Zeng and Thomas Lumley. Review of statistical learning methods in integrated omics studies (an integrated information science). Bioinformatics and Biology Insights, 12:1–16, 2020.
-  Ni Zhao, Jun Chen, Ian M. Carroll, and et al. Testing in microbiome-profiling studies with mirkat, the microbiome regression-based kernel association test. The American Society of Human Genetics, 96:979–807, 2015.
-  Wei Zheng, Hongfei Lin, and Zhehuan Zhao. A graph kernel based on context vectors for extracting drug–drug interactions. Journal of Biomedical Informatics, 61:34–43, 2016.