A generalized kernel machine approach to identify higher-order composite effects in multi-view datasets

04/29/2020
by   Md. Ashad Alam, et al.
0

In recent years, a comprehensive study of multi-view datasets (e.g., multi-omics and imaging scans) has been a focus and forefront in biomedical research. State-of-the-art biomedical technologies are enabling us to collect multi-view biomedical datasets for the study of complex diseases. While all the views of data tend to explore complementary information of a disease, multi-view data analysis with complex interactions is challenging for a deeper and holistic understanding of biological systems. In this paper, we propose a novel generalized kernel machine approach to identify higher-order composite effects in multi-view biomedical datasets. This generalized semi-parametric (a mixed-effect linear model) approach includes the marginal and joint Hadamard product of features from different views of data. The proposed kernel machine approach considers multi-view data as predictor variables to allow more thorough and comprehensive modeling of a complex trait. The proposed method can be applied to the study of any disease model, where multi-view datasets are available. We applied our approach to both synthesized datasets and real multi-view datasets from adolescence brain development and osteoporosis study, including an imaging scan dataset and five omics datasets. Our experiments demonstrate that the proposed method can effectively identify higher-order composite effects and suggest that corresponding features (genes, region of interests, and chemical taxonomies) function in a concerted effort. We show that the proposed method is more generalizable than existing ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 13

10/18/2018

A Self-Organizing Tensor Architecture for Multi-View Clustering

In many real-world applications, data are often unlabeled and comprised ...
07/14/2017

Kernel Method for Detecting Higher Order Interactions in multi-view Data: An Application to Imaging, Genetics, and Epigenetics

In this study, we tested the interaction effect of multimodal datasets u...
01/13/2022

A robust kernel machine regression towards biomarker selection in multi-omics datasets of osteoporosis for drug discovery

Many statistical machine approaches could ultimately highlight novel fea...
06/14/2020

Multi-view Low-rank Preserving Embedding: A Novel Method for Multi-view Representation

In recent years, we have witnessed a surge of interest in multi-view rep...
02/15/2021

KNH: Multi-View Modeling with K-Nearest Hyperplanes Graph for Misinformation Detection

Graphs are one of the most efficacious structures for representing datap...
02/05/2021

A Variational Information Bottleneck Approach to Multi-Omics Data Integration

Integration of data from multiple omics techniques is becoming increasin...
08/12/2021

Analyzing hierarchical multi-view MRI data with StaPLR: An application to Alzheimer's disease classification

Multi-view data refers to a setting where features are divided into feat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern technology allows us to overcome the challenges of assembling a multi-view dataset with a powerful combination of flexibility and low cost. Human complex diseases are typically expressed by the aberrant interplay of multiple biological sources and environmental factors. While single biomedical data view such as genome (whole genome sequencing), transcriptome (RNA sequencing), epigenome (e.g., DNA methylation), proteome, metabolome, lipidome, and medical images offers an essential framework for detecting biological variants contributing to complex diseases, an in-depth understanding of biological mechanisms by examining only single data may not suffice. Most biomedical data analysis methods are applicable to single or dual-views of datasets, without making full use of multi-view datasets. In addition, multi-view biomedical datasets are often complementary to each other and correspond to different feature spaces [11, 30, 23, 34, 17]. Thus, their comprehensive and systematic analysis has the potential to uncover unknown biological mechanisms of complex diseases (Figure 1). With the advent of recent technologies that make assembling of multi-view biomedical datasets (e.g., multi-omics and imaging scans) of a complex diseases possible, development of efficient analytic frameworks for such datasets is becoming an active area of scientific inquiry [29, 35, 20, 53, 41]

. For an effective diagnosis of a complex disease, different types of medical imaging techniques, including computer tomography (CT), ultrasound, functional magnetic resonance imaging (fMRI), structural MRI (sMRI), positron emission tomography (PET) scans, diffusion tensor imaging (DTI), and X-Ray are used

[46, 3, 9, 16, 12].

The high demand for characterizing complex diseases in biomedical science and the available technologies justify the need for more effective biomedical data integration approaches. Linear data integration approaches including supervised, unsupervised, semi-supervised learning, multi-omics factor analysis, and highlight matrix factorization methods are used to gain a deeper, more holistic understanding of the biological mechanism for complex diseases

[39, 30, 6, 36, 18]. These are widely used and validated approaches but they perform poorly when data structure is non-linear and data comes from a multi-modal distribution [19]. As a consequence, non-linear integrated approaches (e.g., kernel based machine) are an important feature in comprehensive analysis of multi-view biomedical datasets [31, 8, 26]. The positive definite kernel based machine approach can overcome the non-linearity problem as well as that of dimensionality of multi-view biomedical datasets [52, 57, 1, 2, 42]. This approach can be useful in effective integration of multi-view biomedical data.

In this study, we introduce a generalized kernel machine approach to identify higher-order composite effects (GKMAHCE) in multi-view datasets. While several kernel machine approaches have been employed to identify marginal, interaction effects of datasets with continuous data, the underlying challenge remains of how to make a generalized kernel machine based model to estimate marginal, interaction, and composite effects of multi-view dataset with categorical data, which is often the case in complex diseases

[55, 7, 56, 21, 10]. This proposed approach considers multi-view data as predictor variables to allow more thorough and comprehensive modelling of complex traits.

Figure 1: An illustration of different omics data, imaging scans, and environment factors along with their global, marginal, interaction and composite on a complex trait.

Here, we compared the performance of our proposed method with that of existing methods using synthesized datasets and real multi-view adolescence brain development and osteoporosis datasets, with brain imaging and five- omics data. To realize adolescence brain development, sex differences in human brain are essential to understand their anatomical foundations in the brain with three view datasets: genomic, non-genomic and fMRI datasets [40]. Osteoporosis is a bone disorder, which results from a loss of bone mineral density (BMD) [43]. For more information of these data, we refer the reader to Section . Within each dataset, this study explores novel genome, epigenome, transcriptome, and chemical mechanisms, as well as robustly and efficiently identifies corresponding factors. To validate the results, we performed Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, the gene ontology analysis of biological process category and gene-gene interaction networks analysis. We also draw causal connections, gene- gene associations, and gene-chemical associations for both of the datasets. This paper is highly novel in the following aspects:

  • To the best of our knowledge, we address a generalized kernel machine approach to identify higher-order composite effects.

  • For statistical testing of the marginal, the interaction, and the composite effects, we derive the test statistics.

  • To show that the proposed approach is highly efficient, we experiment with synthesized datasets and three-view of brain development in adolescence and five-view of osteoporosis datasets. As far as we know, this is the first five omics integration study.

  • Finally, we perform a comprehensive pathway and network-based analysis for the biological validation of the selected genes.

The rest of this paper is organized as follows. In Section 2, we discuss some contemporary related work. In Section , we explain the proposed approach, generalized kernel machine approach to identify higher-order composite effects in multi-view datasets and test statistics for statistical testing for higher order composite effects.. In Section , we describe the experiments conducted on both synthesized datasets and two real datasets: imaging and genetics for adolescence brain development and multi-omics datasets of osteoporosis studies. We conclude the paper with a discussion on major findings and future research in Section . Detailed derivation of the proposed method along with Satterthwaite approximation to the score test and the applications to real dataset can be found in the supplementary materials.

2 Related work

Kernel based multi-view data integration approaches allow the joint analysis of multiple data types to provide a global view of the biological functions and offer insights into the nature of the interactions between the different datasets. These methods offer useful ways to learn how an extensive collection of genetic variants are associated with complex phenotypes and can be used to explore the relationship between genetic markers and a disease state [10, 37, 24].

The marginal, interaction, and composite effect identifications are becoming a common challenge to multidimensional imaging and multi-omics data analysis. Preliminary works in kernel machine methods have been boldly pursued by [28], where a single modal dataset was used to test for a genetic pathway effect. [27] also proposed a kernel machine-based method for gene-gene interactions. They treated each gene as a testing unit for gene-gene interactions. A kernel machine method was then proposed for detecting multiple factor interactions where a smoothing spline-ANOVA decomposition method was adopted [27]. However, these approaches only use single or pairwise datasets. A sequence kernel association tests (SKAT) for genome wide association studies is introduced in [21]. While SKAAT is widely used in genomics studies, it is limited to only a single omics. In addition, [14] have proposed a kernel machine method for detecting the effects of interactions between multidimensional variable sets. This is an extended model of [27] to jointly model genetic and non-genetic features and their interactions. A microbiome regression based kernel association test (MiRKAT) has been proposed by [56], which directly regresses the outcome on a single omics dataset, microbiome profiles, through the semi-parametric kernel machine regression framework.

Recently, [3] has proposed a kernel machine method for detecting higher-order interactions in three views datasets which was applied to schizophrenia with a continuous trait . Although a generalized version of such methods for more than three datasets as well as both categorical and continuous traits remains elusive, such a method for multi-view biomedical data analysis is necessary to discover new information about biological systems and complex diseases.

3 Approach

Suppose we have independent identical distributed (IID) subjects with covariates and m-view datasets, . Assuming that follows a distribution in the exponential family with density

(1)

where and are the location and scale parameters, and are known functions, and

is know weight, respectively. The mean and variance of

satisfy and . In the following generalized semiparametric model, we associate the output with covariates including intercept and -view datasets:

(2)

where is a known monotone link function, is a vector of covariates including the intercept for the th subject, is a vector of fixed effects, is an unknown function on the product domain, with .

Error family Link Inverse of link Use
Gaussian Identity, g(y)=y Normally distributed set of data,
Poisson Log, g(y)=log(y) Exponential, exp(y) equidispersed count set of data
Binomial Logit, Binary set of data, 0/1
Gamma Inverse gamma Positive continuous set of data,
Table 1: Family and link functions of generalized liner model

According to the ANOVA decomposition, the function, can be extended as:

(3)

where, ’s are the main effects of the respective dataset (), ’s are pairwise interactions effects, ’s are the interactions effects of three datasets and so on. The functional space, RKHS, decomposes as:

(4)

equipped with an inner product, and a norm

This generalizes kernel regression by allowing the model to be related to the response variable via a link function including kernel regression, logistics kernel regression, and Poisson kernel regression. For the binary data, the link function

provides the logistic kernel regression; for the count data, provides the Poisson kernel regression; for Gaussian data, gives classical kernel regression. The main aspect of this paper is applying this to five views: genome, epigenome, transcriptome, metabolome, and lipidome along with Low BMD and High BMD information of the subject. To do this, assume that we have IID subjects under investigation; is a binary (LBMD or HBMD) phenotype for the -th subject. We associate the clinical covariates (e.g., age, weight, height) with five views including genome, epigenome, transcriptome, metabolome, and lipidome.

Let denoted the covariates, where is a measure of the -th subject.

Let , , , , and , are a gene with SNP markers, a gene with methylation profiles, a gene with RNA-Seq, a chemical taxonomy class with metabolite profiles, and a chemical class with lipid profiles of the -th subject, respectively. Under this setting, Eq. (2), Eq. (3) and Eq. (4) becomes:

(5)

where .

3.1 Model estimation

We can estimate the function

by minimizing the penalized loss function of Eq. (

5) as:

(6)

where L is a loss function, is a roughness penalty with tuning parameter

. By considering the negative log-likelihood of the binomial distribution,

the problem becomes as a kernel logistic regression problem:

(7)

It is known that the complete function space of Eq. (3), , has the orthogonal decomposition. Hence the function can be decomposed accordingly (see the supplementary material for details).

It is convenient and mathematically friendly to minimize numerically. Let

be the probability of the

observation and

Following many in the literature, we have a liner mixed effects model for such that

(8)

where is a coefficient vector of fixed effects, , , , , , and are independent random effects with normal distributions. The relationship of Eq. (6) and Eq. (8) confirm that all of the effects obtained by optimizing the loss function in Eq. (6), are the equal to the best linear unbiased predictors (BLUPs) of the linear mixed effects model in Eq. (8). This relationship makes a accessibility for testing the variance component instead of testing the nonparametric function in hull hypothesis, which we can estimate using the restricted maximum likelihood (ReML) approach (see the supplementary material for details).

3.2 Statistical testing

In this section, we address the test statistic of the overall effect, marginal effects, different interaction effects, and different composite effects.

3.2.1 Overall hypothesis testing

According to our model, the testing overall effect is

is equivalent to test the variance components in Eq.(8),

It is known that kernel matrices are not block-diagonal and the parameter in variance component analysis are placed on the edge of the parameter space when the null hypothesis is true. While the asymptotic distribution of a likelihood ratio test (LRT) statistic is neither a chi-square distribution nor a mixture chi-square distribution under the null hypothesis, we can use a score test statistic on the restricted likelihood

[3, 28]. We derive score test statistic for a generalized kernel machine approach in multi-view datasets, Eq. (8). Assuming that , where . We can write the restricted log-likelihood function of Eq. (8) as follows

(9)

Using the partial derivative of Eq. (9) with respect to each variance component. We derive the estimate of the variance components and the score test statistic which is defined as

(10)

where and is the maximum likelihood estimator (MLE) of the logistic regression coefficient under the null model , is the variance of . is quadratic function of the variable , which follows a weighted mixture of the chi-square distribution under the null hypothesis. By the Satterthwaite method, we can approximate the distribution of to a scaled chi-square distribution, (). To estimate the the scale parameter

and the degrees of freedom

we can we can use the method of moments on the mean (

) and variance () of the test statistic and we obtain and . Finally, to get the value of an experimental score statistic we use the scaled chi-square distribution .

3.2.2 Testing marginal effects

By using the respective kernel matrix, we can test the marginal effect for each view data. We known that the testing marginal effect

is equivalent to test the variance components in Eq.(8),

The test statistic is

This is same as the sequence kernel association test (SKAT) [51].

3.2.3 Testing interaction effect

By considering no marginal effects, i.e., , we can also test the interaction effects. We know that the testing interaction effect

is equivalent to test the variance components in Eq.(8),

The test statistic is

Also by considering no marginal effects, i.e., , and all second order interactions are zero we can also test the 3rd order interaction effects.

The 3rd order test statistic is

and so on.

3.2.4 Testing composite effects

If lower order effects are statistically significant, we want to test higher order effects is called composite hypothesis testing. To test the 3rd order interaction effect, we mansion that testing the null hypothesis is equivalent to testing the variance component: . Let , and all , and are model parameters under the null model.We formulate a test statistic

(11)

where , and is the projection matrix under the null hypothesis.

Similar to the overall effect test, we can use the Satterthwaite method to approximated the distribution of higher order intersection test statistic by a scaled chi-square distribution with scaled and degree of freedom i.e., . The scaled parameter and degree of freedom are estimated by MOM, and , receptively. In practices, the unknown model parameters are estimated by their respective ReML estimates under the null hypothesis. Lastly, the value of an observed higher order interaction effect test score statistic is obtained using the scaled chi-square distribution . and so on.

3.2.5 Kernel machine based model selection

The result of kernel-based machine learning approaches is highly influenced by the kernel and its parameters. The model selection with a suitable kernel is essential in machine learning. As a result, machine learning approaches suffer from weak selection of a suitable model. Suppose

is a positive definite kernel. A linear positive definite kernel, , where , on can define the similarity measure in the Euclidean space. This liner kernel suffers from more complexity in the function space for high-dimensional datasets. Instead of linear kernel, we can use a polynomial kernel to make it possible to use higher order correlations among data points. A polynomial kernel, has two free parameters: c is a tradeoff between higher order and lower order in the polynomial and d is the degree of the polynomial. The dependency measure depends on the finite bounded degree. Another limitation of linear and polynomial kernel is boundedness, i.e., both kernels are unbounded.

From a finite dimensional space the Gaussian kernel can map the input space (data space) into a infinite dimensional space [42] and is defined as:

It has a free parameter to select but consist of numerous theoretical properties including boundedness, consistence, universality, robustness, and so on [45]. To fix the free parameter, we can use the median of the pairwise distance[15, 44].

For a multi-omics study, to captures the pairwise similarity across a number of SNPs in each gene we can use a identity-by-state (IBS) kernel (nonparametric function of the genotypes) [25]:

where is the number of SNP markers of the corresponding gene. It is assumption free on the type of genetic interactions. Hence, it can capture any genetic effects on the phenotype. In this paper, we used the IBS for the genetic dataset and for all other datasets, respectively.

4 Experiments

We conducted experiments on both the Monte Carlo simulation and two real multi-view datasets: an imaging and genome datasets from the adolescence brain development and five omics datasets from osteoporosis studies, respectively. The goal is to select a feature based on significant composite effects among the datasets. We considered the Gaussian kernel (the median of the pairwise distance as the bandwidth) For all datasets but the genetic dataset. For the genetic dataset, we considered the IBS kernel [15, 3]. The optimization of the proposed approach is based on Fisher’s scoring algorithm (the ReML algorithm). To that end, we follow the parameters setting as in [3] for the ReML algorithm applicable. To overcome the potential of a local minima, we considered a set of initial points in (0, 1) and picked the point which maximized the ReML algorithm.

We compare our simulation and real data results with three other methods: principal component SKAT, partial principal component regression (pPCAR), and full principal component regression (fPCAR)) as logistic regression setting. The SKAT approach is a flexible and computational in GWAS [51, 21]

. To identify an interaction effect of two genes, a principal component analysis based approach has been proposed by

[27]. [3] has been considered each method as a simple regression setting. Here, we switched each method as a logistic regression setting.

4.1 Simulation studies

To evaluate the performance of the proposed method, we do simulation studies. To that end, we use the following model to generate synthesized qualitative phenotype:

(12)

where is a vector of covariates (height and weight) with an intercept of -th subject () and is a coefficient vector, the random error,

, follows the Gaussian distribution with zero mean and unit variance and

is the standard deviation of the error and is fixed to

. Here ’s represent five different datasets. Each function ’s is defined as in different non-linear form.

In this setting, we simulated data under different values of parameters to evaluate the performance of the test. For example, means all effects have vanished and we can examine the false positive rate for the score test under the overall effect. Similarly, we take into account the main effects (2nd order interaction effects) but no higher order interaction effects to evaluate the power of the score test. To get consist results, we performed simulations for each setting.

The simulation results of different methods can be found in Table 2. In this table, we report the power of that higher-order composite score test in different parameter settings. Overall, we made two observations. First, we noticed that the false positive rate of the test for higher-order composite effects score test was controlled by fixing the nominal value threshold to . All four methods have the same observations for the false positive rate. Second, when considering the power analysis (), we observed that the proposed method performs better than other methods and its power exceeds (Table 2). On the other hand we observed that the state-of-the-art methods (pPCAR, fPCAR and SKAT) can significantly overstate the false positive rates and result in significant loss of statistical power.

We also plot several visualization of the ROC with all interrelated parameters () values are random but the linear parameter is fixed to . We allocate each number with probability

. A random number is uniformly distributed either in a range or at

. In Figure 2, we look into the results from the receiver operating characteristic (ROC) with three sample sizes, . The sensitivity is plotted against (1- specificity) for each -value in the range of with a step size . The power gain of the proposed method relative to the alternative ones is evident in all situations.

Parameters Simulation - I
GKMAHCE State-of-the-art methods
pPCAR fPCAR SKAT
(, , , , )
(0.1, 0, 0, 0, 0)
(0, 0, 0, 0, 0.1)
(0, 0, 0, 0, 0.5)
(0, 0, 0, 0, 1)
(1, 1, 1, 1, 0.1)
(1, 1, 1, 1, 0.5)
(1, 1, 1, 1, 1)
Table 2: The power of higher-order composite score test of the proposed method (GJNAGCE), and state-of-the-art methods using dimension reduction regression (e.g., pPCAR, fPCAR) and sequence kernel association test (SKAT)
Figure 2: The Receiver Operating Characteristics of the kernel methods and relevant ones for higher-order composite effects detection with three sample sizes, for all interrelated parameters () values are random but the linear parameter is fixed to .

4.2 Real data analysis

We apply the proposed method to two real multi-view datasets including an fMRI imaging dataset and five omics datasets from the adolescence brain development and osteoporosis studies, respectively.

4.2.1 Application to adolescence brain development

The Philadelphia Neurodevelopmental Cohort (PNC) is a research initiative that focuses on characterizing brain and behavior interaction with genetics. Sex differences in human brain are essential to understand their anatomical foundations in the brain [40, 49]. Our goal here is to select features using the test composite effect on the sex differences in cognition in youth. To that end we consider three view datasets: genomic, non-genomic and fMRI imaging datasets. Table 3 presents the configuration of the three datasets along with phenotype. The gene consists of genetic features (SNPs); the non-genomic data consists of clinical measurements; and the ROI consists of imaging features (voxels). We consider each gene, all non-genomic data and each ROI as a unique testing unit. To apply the proposed method, first we selected most significant genes out of genes using the SKAT approach and ROIs out of (since ROIs contain missing values). In total, we get ( triplets to test. The overall test gives us significant triplets () out of . In Figure 3, we visualize the plot of for triplets. The vertical solid, doted and double doted lines correspond to the p-values of , and , respectively. At p-values of and we detected and significant genes, as well as and significant ROIs, respectively.

Genomic Non-genomic Imaging
SNP Clinical measurement Voxel
95639 116 921429
Feature Gene ROI
13150 264
Subject 863 8719 897
common 798
Covariate Age
Phenotype Gender
421 male and 377 female
Table 3: The configuration of the three data views of adolescence brain development

Table 4 presents the ReML estimates of all parameters, , , , , , , , and the -values for both the proposed and SKAT methods for each of the triplets. By the proposed method, these triplets were identified to have significant interactions at a level of . At this -value, we observe unique genes (RPA3OS, ESR1, WWOX, RGS7, PALLD) and ROIs (CFL.L: Left cerebrum frontal lobe; CSL.R: Right cerebrum sub lobar; CLL.R: Right cerebrum limbic lobe; CSS.R: Right cerebrum cub-lobar; CFL.R: Right cerebrum frontal lobe), that may have significant effects of sex on adolescence brain development [5, 33].

To test whether genes interact with each other, we constructed a gene-gene interaction network analysis using STRING which imports gene/protein association knowledge from databases on both physical interactions and curated biological pathways [47]. In STRING (e.g. STRING 11.0), the simple interaction unit is the functional relationship between two genes. This relationship can play a role to a common biological purpose. Figure 4 illustrates the gene-gene network based on the protein interactions with the selected genes ( including 7 individual selected genes). In this figure, the color saturation of the edges represents the confidence score of a functional association. Further network analysis shows that the number of nodes, number of edges, expected number of edges, average node degree, clustering coefficient, -values are , , , , for , respectively. This analysis confirmed that the selected genes had significantly more interactions than expected and indicates that these genes may function in a concerted manner.

Figure 3: The plot of with triplets (gene, clinical measurement and region of interest.
GKMAHCE SKAT
Genetics Imaging OVA HOI HOI
CM
Table 4: The selected significant genes and ROIs using the proposed method and the SKAT. The value was set to be (CM: clinical measurement)
Figure 4: Network analysis of 7 individual (for example) selected genes ((a), (b), (c), (d), (e), (f), and (g)) and all selected genes with the proposed methods: the network has significantly more interactions than expected. The color saturation of the edges represents the confidence score of a functional association. The colored nodes, white nodes, empty nodes, and filled nodes are noted for query proteins and the first shell of interactors, second shell of interactors, proteins of unknown 3D structure, and some 3D structure is known or predicted, respectively.

For investigating classification accuracy of the proposed method, the genome, non-genome, fMRI imaging datasets and only selected features via the proposed method are used to classify the sex-related brain differences followed by the three classifiers: the K-nearest neighbors algorithm (KNN) and the support vector machine (SVM) (the linear kernel and the Gaussian kernel). For the proposed approach, we considered triplets that have significant effects on the gender:

genes and ROIs only (from Table 4). The classifier is used to distinguish the adolescence brain development between the female and the male. Table 5

presents the train classification error with different datasets and the selected features only. These results demonstrated that the proposed method is a better tool for feature selection.

Dataset K-NN algorithm Linear kernel Gaussian kernel
Genome 23.06
Imaging
Non-genome
Only selected features 21.05 0.07
Table 5: The training classification error of discriminating low BMD subjects from high BMD with the K-nearest neighbors (K-NN) algorithm and the support vector machine (linear kernel and Gaussian kernel)

4.2.2 Application to osteoporosis study

Osteoporosis is a bone disorder which is largely attributable to the increased bone resorption and/or decreased bone formation by osteoclasts and osteoblasts, respectively. We apply the proposed method to a multi-omics datasets from an osteoporosis study [54, 32, 4]. This dataset included genome, epigenome, transcriptome, metabolome, and the lipidome data from Caucasian females with high BMD and with low BMD. Table 6 presents the configuration of the five datasets along with phenotype. The original genomic data consists of SNPs which annotated to 25, 442 genes. The epigenome data consists of methylations which annotated to genes. This methylation analysis focus on gene level/sliding windows. The transcriptome data has genes expression profiles. The metabolome data contains of chemical taxonomies including metabolic profiles. The lipidome data covers chemical taxonomies including lipid profiles.

Genome Epigenome Transcriptome Metabolome Lipidome
SNP Methylation RNA-sequencing profiling profiling
3997535 46690 291 56
Feature Gene CT class
25442 4676 22682 27 7
Subject 129 128 128 136 136
common 108
Covariate Age,  Height  and  Weight
Phenotype Bone mineral density (BMD)
Low BMD and High BMD
Table 6: The configuration of the five data views of the osteoporosis study (CT: chemical taxonomy)
Figure 5: The Venn diagram of the number of selected genes for CCOut, LCCAOut, KCCAOut and LIMM methods with (a) the genome data; (b) the epigenome data and (c) the transcriptome data.
Figure 6: The plot of with quinlets (overall significance in the mode out of .
-values Gene Chemical taxonomy
Genome Epigenome Transcriptome Metabolome Lipidome
Table 7: The number of significant genes (genome, epigenome and transcriptome datasets) and chemical taxonomy (Mmetabolome and Lipidomoe datasets) of the proposed method using adjusted p-values)
Genome ABCB8 ACTBP13 AP003122 DOCK10 EXOSC3P2 FGF5 GPLD1 LINC00461 CHGB NIPA2P5 NIPAL1
PANK1 RPL13AP3 RPLP2 SCRN2 TRIM65 ZNF114 ZNF35
Epigenome A1CF ALLC AP1G1 ATP2B2 BACH2 CA13 CCDC111 DNER DYM FASTKD2 GCNT7
HCST INPP5F KIAA1305 KNDC1 NGTN LGTN LMNB1 LOC283871 MAPKAPK2 MAST4 MDC1
NOC2L NRL SLC44A3 TRAIP UBE2J2 VPS26A ZNF700
Transcriptome ABI1 ACER3 ADCY3 ALOX15B AUNIP BSND BTNL3 CDKN1A C10orf105 CROCC CSMD3
CTSLP2 GALT HEXDC HLA-DRA HM13 HMBOX1 IGHMBP2 KLRC1 LRRC8C MAGOH MCTS2P
NEURL4 NIN NPHS1 PRPF38B RALBP1 RNF17 SERPINB6 SLC38A9 SOX1 TECTA TGFBR3L
USP17L17 USP53
Metabolome Azoles BENZ BSD CAD CBS CXAD DIA FAE FAL HAD HDMC
IDA LAC PHE PLA PLL PRNS PYRD SPHG STSD TTRD
Lipidome AA DGLA DHA EA EPA ISTD LA
Table 8: The selected significant genes (genome, epigenome, and transcriptome datasets) and chemical taxonomies (metabolome and lipidome datasets) using the proposed at
Figure 7: The expression patterns of three selected genes (ABL1, GCNT7, ZNF114) across different tissues. RPKM, reads per kilo base per million mapped
Gene GeneHancer GeneHancer Association Total Source Related
ID ID Score Score Score (Total No.): PMID Disease
ABL1 GH09J130833 :  , ; ; Bone marrow; Brain, Endometrium; Lymphy node
AP1G1 GH16J071807 ; ; ; ; Bone marrow; Kidney; Brain, Testis; Thyroid
GCNT7 GH20J056524 : ; Bone marrow; thyroid
PANK1 GH20J056524 : ; ; ; ; Liver; Kidney; Bone marrow; Thyroid
ZNF35 GH03J044648 : ; ; ; ; Testis; Thyroid; Urinary bladder, Endometrium; Brain, Bone marrow
ZNF114 GH07J151024 : ; ; ; ; Thyroid, Brain; Placenta
Table 9: The GeneHancer identifier, GeneHancer score, gene association score, total score, major-related diseases, and PubMed database (PMID) of 6 selected genes
Figure 8: The causal relation of the metabolome dataset using Hill-Climbing approach: (a) low BMD and (b) high BMD.
Figure 9: The causal relation of the lipidomome dataset using Hill-Climbing approach: (a) low BMD and (b) high BMD.

To apply the proposed method, we considered each gene of the genome, epigenome, and transcriptome data and each chemical taxonomy of the metabolome and lipidome data as a single testing unit. To make it feasible, we reduced the dimensionality of the genome, epigenome and transriptome data using four approaches including the kernel based gene shaving approach [4]. Figure 5

presents the Venn-diagram of the t-test, canonical correlation analysis based gene shaving (CCAOut), kernel canonical correlation analysis based gene shaving (KCCAOut) and Linear Models for Microarray based gene shaving (LIMMA) methods for three datasets: (a) genome data (b) epigenome data, and (c) transcriptome data. From this figure, we observed that the number of disjointedly selected genes from CCOut, LCCAOut, KCCAOut and LIMMA methods for the genome data are

, , , ; for the epigenome data are , , , ; and for the transcriptome data are , , , , respectively. All methods selected , and common genes for the genome, epigenome and transcriptome data, respectively. In addition, we have and chemical taxonomies of metabolome and lipidome dataset, respectively. In total, we get ( quinlets to test. Figure 6, shows the plot of for quinlets. The overall test gives us significant quinlets (). The vertical solid, doted and double doted lines correspond to the p-values of , and , respectively. Table 7 presents the number of significant genes and chemical taxonomies using different adjusted p-values of the proposed method. A list of genes (genome, epigenome and transcriptome data) and list of chemical taxonomies (metabolome and lipidome data ) are tabulated in Table 8.

We also extracted the gene-gene interaction networks using STRING: functional protein association networks. Figure S1, (in the Supplementary material) shows the gene-gene networks based on the protein interactions among the selected () genes by the proposed method of the genome, epigenome, and transcriptome datasets. In this figure, the color saturation of the edges represents the confidence score of a functional association. In addition, network analysis demonstrates that the number of nodes, number of edges, expected number of edges, average node degree, clustering coefficient, PPI enrichment p-values are and , respectively. This network analysis confirms that the selected genes have significantly more interactions than expected. It also indicates that the selected genes may function collaboratively.

To confirm the biological roles of the most prominent identified features, we accessed the biomedical and genomic information using the national center for biotechnology information (NCBI, https://www.ncbi.nlm.nih.gov/) tools and the Human gene database (GeneCards: https://www.genecards.org/) [13]. NCBI offers a wide variety of data analysis tools for manipulating, aligning, visualizing, and evaluating biological data. On the other hand, the primary goal of the GeneCards database is the unequivocal identification of enhancer elements and uncovering their connections to genes for understanding gene regulation and the molecular pathways. Recent studies have shown that the bone marrow, kidney, testis, and thyroid diseases have widespread systemic manifestations including their effects on developing osteoporosis. [38, 22, 48, 50]. So, we hypothesized that the selected genes, with the proposed method, would provide more ubiquitous expression (reads per kilo base per million mapped reads (RPKM)) of bone marrow, kidney, testis, and thyroid tissue samples. To that end, we used NCBI’s tools. Figure 7 illustrates the expression patterns of three selected genes (ABL1, GCNT7, ZNF144) across different tissues from 95 human individuals. These results show that these genes have remarkably high RPKM value for bone marrow, kidney, testis, and thyroid tissue samples. These tissue related diseases including their effects may play a role in the development of osteoporosis. To provide insight into the gene regulatory elements (promoters and enhancers) for 6 selected genes (ABL1, AP1G1, GCNT7, PANK1, ZNF35, ZNF144), we used the GeneCards database. Table 9 presents GeneHancer identifier, GeneHancer score, gene association score, total score, major-related diseases, and PubMed database. From this table, we observed that the selected genes have had a remarkable GeneHancer score, gene association score, total score, and literature review in the past studies. According to the disease annotation, the selected genes are highly associated with complex diseases, which are higher risk of developing osteoporosis (the results of all selected genes are in Supplementary material).

To further explore whether that the selected chemical taxonomies of metabolic and lipidomic features will have unique relationships in the low BMD and high BMD groups. We infer a causal relationship with a Hill-Climbing approach among the select chemical taxonomies in each group of both datasets (metabolome and lipidome). The causal relationship of the metabolome and the lipidome datasets: (a) low BMD and (b) high BMD is illustrated in Figure 8 and Figure 9, respectively. In both datasets, we observed that the causal relationships are different in both the low BMD and high BMD for both datasets. For the metabolic data TTRD has an important impact on the low BMD but has an negative impact on high BMD but DA has an important impact on the high BMD. In case of the lipidome data, we observed that the LA has an impact on the low BMD but does not have any impact in the high BMD. By this observation, we concluded that the selected chemical taxonomies may contribute on BMD status but not in general.

Finally, we investigated classification accuracy via only selected features of the proposed method. To classify the low BMD subjects from the high BMD subjects followed by the three classifiers: the KNN and the SVM (the linear kernel and the Gaussian kernel). For the proposed approach, we considered triplets that have significant (from Table 8) effects on the BMD. The classifier is used to classify the low BMD subject from the high BMD subject. Table 10 presents the train classification error with different datasets and the selected features only. These results are also demonstrating that the proposed method is a better tool for feature selection.

Dataset K-NN algorithm Linear kernel Gaussian kernel
Genome
Epigenome
Transcriptome
Metabolic
Lipidomic
Only selected feature
Table 10: The training classification error of discriminating low BMD subjects from high BMD with the K-nearest neighbors (K-NN) algorithm and the support vector machine (linear kernel and Gaussian kernel)

5 Concluding remarks

The technology of biomedical science has accelerated the cycle of discovery. In this we propose a novel generalized kernel machine approach to identify higher-order composite effects in multi-view biomedical datasets. This semi-parametric approach considers multi-view data as prediction variables to allow comprehensive modeling of complex disease trait.

The power of the proposed method is further demonstrated by its application to both synthesized and two real multi-view datasets, i.e., adolescence brain development and osteoporosis study. In simulation studies, we have shown that the false positive rate of the test for higher-order composite effect is mitigated by fixing the nominal p-value threshold along with other state-of- the-art-methods. From the power analysis, our proposed method not only performs better than other methods. The ROC curves also showed the gain of power by the proposed method in all scenarios.

In adolescence brain development experiments, we found five unique genes and 6 ROIs along with the non-genomic factor that may have significant gender effects on adolescence brain development. In addition, by the gene-gene interaction networks, we confirmed that the selected genes have significantly more interactions than expected. This result implies that the gens and their functional interactions form the backbone of the cellular machinery.

In the osteoporosis study, the network analysis confirmed that the selected genes of genome, epigenome, and transcriptome datasets also have significantly more interactions than expected. It also indicates that the selected genes may function in a collaborative effort. We found that they function in a concerted effort and have biological relevance to osteoporosis. Furthermore, using a causal relationship, we conclude that the selected chemical taxonomies (metabolome and lipidome) may contribute to development of BMD status.

The primary focus of our paper is to comprehensively characterize complex disease (e.g., adolescent brain development, or osteoporosis ). Collectively, we can use this approach to derive a statistic for testing the composite effect. These studies explore novel genomic, epigenomic, transcriptomic and chemical mechanisms to identify corresponding factors. The significant triplet, quartet, quintet of extracted features suggest that these effects might highlight biological targets for drug development. Extrapolating these findings enable an advanced understanding of normal cellular processes. Our proposed method can be applied to the study of any disease models, where multi-view data analysis is commonly used.

We acknowledge that the kernel machine is sensitive to contaminated data, even if bounded positive definite kernels are used. Further work on robust kernel methods including M-estimator based methods. They will allow us to uncover the rich, hidden connections in biological system, and may provide a more comprehensive modeling of complex traits.

Acknowledgments

This work was partially supported by grants from National Institutes of Health (NIH) [U19AG05537301, R01AR069055, P20GM109036, R01MH104680, R01AG061917], and the Edward G. Schlieder Endowment and the Drs. W. C. Tsai and P. T. Kung Endowment to Tulane University.

References

  • [1] M. A. Alam. Kernel Choice for Unsupervised Kernel Methods. PhD. Dissertation, The Graduate University for Advanced Studies, Japan, 2014.
  • [2] M. A. Alam, V. Calhoun, and Y. P. Wang.

    Influence function of multiple kernel canonical analysis to identify outliers in imaging genetics data.

    Proceedings of 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB),Seattle, WA, USA, pages 210–2198, 2016.
  • [3] M. A. Alam, Hui-Yi Lin, Hong-Wen Dengc, V. Calhoun, and Y. P. Wang. A kernel machine method for detecting higher order interactions in multimodal datasets: Application to schizophrenia. Journal of Neuroscience Methods, 309:161–174, 2018.
  • [4] M. A. Alam, Mohammd Shahjaman, Md. Ferdush Rahman, Fokhrul Hossain, and Hong-Wen. Gene shaving using a sensitivity analysis of kernel based machine learning approach, with applications to cancer data. PLOS One, 14(5):e0217027, 2019.
  • [5] Peter F Bladin Michael M Saling Amee D Baird, Sarah J Wilson and David C Reutens. Neurological control of human sexual behaviour: insights from lesion studies. Journal of Neurology, Neurosurgery & Psychiatry, 74(10):1042–1049, 2007.
  • [6] Ricard Argelaguet, Britta Velten, Damien Arnol, and et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Molecular System Biology, 14:e8124, 2018.
  • [7] Matteo Bersanelli, Ettore Mosca, Daniel Remondini, and et al. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics, 17 (15):167–202, 2016.
  • [8] K. M. Borgwardt and et al. Protein function prediction via graph kernels,. Bioinformatics, 21:i47–i56, 2005.
  • [9] Joana Cabral, Morten L. Kringelbacha, and Gustavo Decoc. Functional connectivity dynamically evolves on multiple time-scales over a static structural connectome: Models and mechanisms. NeuroImage, 160:84–96, 2017.
  • [10] G. Camps-Valls, J. L. Rojo-Alvarex, and M. Martinez-Romon. Kernel Methods in Bioengineering, Signal and Image. Idea Group publishing, London, 2007.
  • [11] Sebastian Canzler, Jana Schor1, Wibke Busch, Kristin Schubert, and et al. Prospects and challenges of multi‑omics data integration in toxicology. Archives of Toxicology, 94:391–388, 2020.
  • [12] Gustavo Deco, Giulio Tononi, Melanie Boly, and Morten L. Kringelbach. Rethinking segregation and integration: contributions of whole-brain modelling. Nature Reviews Neuroscience, 2015.
  • [13] Simon Fishilevich, Ron Nudel, and et al. Genehancer: genome-wide integration of enhancers and target genes in genecards. Database, 2017:1–17, 2017.
  • [14] T. Ge, T. E. Nichols, D. Ghoshd, E. C. Morminoe, am M. R. Sabuncu J. W.Smoller, and the Alzheimer’s Disease Neuroimaging Initiative. A kernel machine method for detecting effects of interaction between multidimensional variable sets: An imaging genetics application. NeuroImage, 109:505–514, 2015.
  • [15] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, 20:585–592, 2008.
  • [16] Enrique C.A. Hansen, Demian Battaglia, Andreas Spiegler, Gustavo Deco, and Viktor K. Jirsa. Functional connectivity dynamics: Modeling the switching behavior of the resting state. NeuroImage, 105:525–535, 2015.
  • [17] Y. Hasin, M. Seldin, and A. Lusis. Multi-omics approaches to disease. Genome Biol, 18 (83):1:15, 2017.
  • [18] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2009.
  • [19] T. Hofmann, B. Schölkopf, and J. A. Smola. Kernel methods in machine learning. The Annals of Statistics, 36:1171–1220, 2008.
  • [20] S. Huang, K. Chaudhary, and L. X. Garmire. More is better: Recent progress in multi-omics data integration methods. Frontiers in Genetics, 8:Article 8, 2017.
  • [21] I. Ionita-Laza, S. Lee, V. Makarov, J. D. Buxbaum, and X. Lin. Sequence kernel association tests for the combined effect of rare and common variants. American Journal of Human Genetics, 92:841–853, 2013.
  • [22] Pascale Khairallah and Thomas L. Nickolas. Management of osteoporosis in ckd. Clinical Journal of the American Society of Nephrology, 13(6):962–969, 2018.
  • [23] Minseung Kim and Ilias Tagkopoulos. Data integration and predictive modeling methods for multi-omics datasets. Molecular Omics, 14:8–25, 2018.
  • [24] S. Y. Kung. Kernel Methods and Machine Learning. Cambridge University Press, New York, 2014.
  • [25] X. Lin D. Ghosh M. P. Epstein L. C. Kwee, D. Liu. A powerful and flexible multilocus association test for quantitative traits. Annals of Human Genetics, 82(2):386–397, 2008.
  • [26] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20:2626–2635, 2004.
  • [27] S. Li and Y. Cui. Gene-centric gene-gene interaction: a model-based kernel machine method. The Annals of Applied Statistics, 6(3):1134–1161, 2012.
  • [28] D. Liu, X. Lin, and D. Ghosh.

    Semiparametric regression of multidimensional genetics pathway data: least squares kernel machines and linear mixed model,.

    Biometrics, 630(4):1079–1088, 2007.
  • [29] Feiyang Ma, Brie K. Fuqua adn Yehudit Hasin, and et al. A comparison between whole transcript and 3’ rna sequencing methods using kapa and lexogen library preparation methods. BMC Genomics, 20 (9):1–12, 2019.
  • [30] Tianle Ma and Aidong Zhang.

    Integrate multi-omics data with biological interaction networks using multi-view factorization autoencoder (MAE).

    BMC Genomics, 20 (11): 994::1–11, 2019.
  • [31] A. C. A. Nascimento, R. B. C. Prudêncio, and Ivan G. Costa. A multiple kernel learning algorithm for drug-target interaction prediction. BMC Genomics, 17:46:1–16, 2016.
  • [32] Chuan Qiu, Fangtang Yu, Kuanjui Su, Lan Zhang, and et al. Multi-omics data integration for identifying osteoporosis biomarkers and their biological interaction and causal mechanisms. iScience, 23(2):100847, 2020.
  • [33] W. B Bilker R. C. Gur, F. Gunning-Dixon and et al. Sex differences in temporo-limbic and frontal brain volumes of healthy adults. Cereb Cortex, 12(9):998–1003, 2002.
  • [34] N. Rappoport and R. Shamir. Nulti-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Research, 46 (20):10546–10562, 2018.
  • [35] Nicholas J. W. Rattray, Nicole C Deziel, Joshua D. Wallach, and et al. Beyond genomics: understanding exposotypes through metabolomics. Human Genomics, 12 (4):1–14, 2018.
  • [36] Marylyn D. Ritchie, Ruowang Li Emily R. Holzinger and, Sarah A. Pendergrass1, and Dokyoon Kim. Methods of integrating data to uncover genotype–phenotype interactions. Nature Review: Genetics, 16:88– 97, 2015.
  • [37] B. D. Moor S. Yu, L-C. Tranchevent and Y. Moreau. Kernel-based Data Fusion for Machine Learning. Springer, Verlag Berlin Heidelberg, 2011.
  • [38] A. Sanghani-Kerai, L. Osagie-Clouard, G. Blunn, and M. Coathup. The influence of age and osteoporosis on bone marrow stem cells from rats. Bone & Joint Research, 7(4):289–297, 2018.
  • [39] Anita Sathyanarayanan, Erik W Thompson Rohit Gupta, and et al. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping. Briefings in Bioinformatics, bbz121:e8124, 2019.
  • [40] F. E. Satterthwaite. An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6):110–114, 1946.
  • [41] Andreas Schmidt, Ignasi Forne, and Axel Imhof. Bioinformatic analysis of proteomics data. BMC Systems Biology, 8 (53):1–7, 2014.
  • [42] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge MA, 2002.
  • [43] T Sölzen and et al. An overview and management of osteoporosis. European Journal of Rheumatology, 4:46–56, 2017.
  • [44] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13:1393–1434, 2012.
  • [45] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Schölkopf.

    Kernel choice and classifiabilit forRKHS embeddings of probability distributions.

    Advances in Neural Information Processing Systems, 21:1750–1758, 2009.
  • [46] Jing Sui, Rongtao Jiang, Juan Bustillo, and Vince Calhoun. Neuroimaging-based individualized prediction of cognition and behavior for mental disorders and health: Methods and promises. Biological Psychiatry, BPS 14136:Articles in Press, 2020.
  • [47] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, h A. Rot, A. Santos, M. Kuhn K. P. Tsafou, and, P. Bork, L. J. Jensen, and C. von Mering. STRING v11.0: Protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research, 43:531–543, 2007.
  • [48] Dominika Tuchendler and Marek Bolanowski. The influence of thyroid dysfunction on bone metabolism. Tuchendler and Bolanowski Thyroid Research, 7(12):1–5, 2014.
  • [49] Birkan Tunc, Berkan Solmaz, and et al. Establishing a link between sex-related differences in the structural connectome and behaviour. Philosophical Transactions Society B, 371:20150111, 2015.
  • [50] P. M. Willemse, N. A. T. Hamdy, and et al. Changes in bone mineral density in newly diagnosed testicular cancer patients after anticancer treatment. The Journal of Clinical Endocrinology & Metabolism, 99(11):4101–4108, 2014.
  • [51] M. C. Wu, S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT). American Journal of Human Genetics, 89:82–93, 2011.
  • [52] K. K. Yan, H. Zhao, and H. Pang. A comparison of graph- and kernel-based – omics data integration algorithms for classifying complex traits. BMC Bioinformatics, 18:539:1–13, 2017.
  • [53] Kui Yang and Xianlin Han. Lipidomics: Techniques, applications, and outcomes related to biomedical sciences. Trends in Biochemical Sciences, 41 (11):954–969, 2016.
  • [54] Fangtang Yu, Chuan Qiu, Chao Xu, , and et al. Mendelian randomization identifies cpg methylation sites with mediation effects for genetic influences on bmd in peripheral blood monocytes. frontiers in Genetics, 11(60):1–14, 2020.
  • [55] Irene Sui Lan Zeng and Thomas Lumley. Review of statistical learning methods in integrated omics studies (an integrated information science). Bioinformatics and Biology Insights, 12:1–16, 2020.
  • [56] Ni Zhao, Jun Chen, Ian M. Carroll, and et al. Testing in microbiome-profiling studies with mirkat, the microbiome regression-based kernel association test. The American Society of Human Genetics, 96:979–807, 2015.
  • [57] Wei Zheng, Hongfei Lin, and Zhehuan Zhao. A graph kernel based on context vectors for extracting drug–drug interactions. Journal of Biomedical Informatics, 61:34–43, 2016.