1 Introduction
Modern technology allows us to overcome the challenges of assembling a multiview dataset with a powerful combination of flexibility and low cost. Human complex diseases are typically expressed by the aberrant interplay of multiple biological sources and environmental factors. While single biomedical data view such as genome (whole genome sequencing), transcriptome (RNA sequencing), epigenome (e.g., DNA methylation), proteome, metabolome, lipidome, and medical images offers an essential framework for detecting biological variants contributing to complex diseases, an indepth understanding of biological mechanisms by examining only single data may not suffice. Most biomedical data analysis methods are applicable to single or dualviews of datasets, without making full use of multiview datasets. In addition, multiview biomedical datasets are often complementary to each other and correspond to different feature spaces [11, 30, 23, 34, 17]. Thus, their comprehensive and systematic analysis has the potential to uncover unknown biological mechanisms of complex diseases (Figure 1). With the advent of recent technologies that make assembling of multiview biomedical datasets (e.g., multiomics and imaging scans) of a complex diseases possible, development of efficient analytic frameworks for such datasets is becoming an active area of scientific inquiry [29, 35, 20, 53, 41]
. For an effective diagnosis of a complex disease, different types of medical imaging techniques, including computer tomography (CT), ultrasound, functional magnetic resonance imaging (fMRI), structural MRI (sMRI), positron emission tomography (PET) scans, diffusion tensor imaging (DTI), and XRay are used
[46, 3, 9, 16, 12].The high demand for characterizing complex diseases in biomedical science and the available technologies justify the need for more effective biomedical data integration approaches. Linear data integration approaches including supervised, unsupervised, semisupervised learning, multiomics factor analysis, and highlight matrix factorization methods are used to gain a deeper, more holistic understanding of the biological mechanism for complex diseases
[39, 30, 6, 36, 18]. These are widely used and validated approaches but they perform poorly when data structure is nonlinear and data comes from a multimodal distribution [19]. As a consequence, nonlinear integrated approaches (e.g., kernel based machine) are an important feature in comprehensive analysis of multiview biomedical datasets [31, 8, 26]. The positive definite kernel based machine approach can overcome the nonlinearity problem as well as that of dimensionality of multiview biomedical datasets [52, 57, 1, 2, 42]. This approach can be useful in effective integration of multiview biomedical data.In this study, we introduce a generalized kernel machine approach to identify higherorder composite effects (GKMAHCE) in multiview datasets. While several kernel machine approaches have been employed to identify marginal, interaction effects of datasets with continuous data, the underlying challenge remains of how to make a generalized kernel machine based model to estimate marginal, interaction, and composite effects of multiview dataset with categorical data, which is often the case in complex diseases
[55, 7, 56, 21, 10]. This proposed approach considers multiview data as predictor variables to allow more thorough and comprehensive modelling of complex traits.Here, we compared the performance of our proposed method with that of existing methods using synthesized datasets and real multiview adolescence brain development and osteoporosis datasets, with brain imaging and five omics data. To realize adolescence brain development, sex differences in human brain are essential to understand their anatomical foundations in the brain with three view datasets: genomic, nongenomic and fMRI datasets [40]. Osteoporosis is a bone disorder, which results from a loss of bone mineral density (BMD) [43]. For more information of these data, we refer the reader to Section . Within each dataset, this study explores novel genome, epigenome, transcriptome, and chemical mechanisms, as well as robustly and efficiently identifies corresponding factors. To validate the results, we performed Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, the gene ontology analysis of biological process category and genegene interaction networks analysis. We also draw causal connections, gene gene associations, and genechemical associations for both of the datasets. This paper is highly novel in the following aspects:

To the best of our knowledge, we address a generalized kernel machine approach to identify higherorder composite effects.

For statistical testing of the marginal, the interaction, and the composite effects, we derive the test statistics.

To show that the proposed approach is highly efficient, we experiment with synthesized datasets and threeview of brain development in adolescence and fiveview of osteoporosis datasets. As far as we know, this is the first five omics integration study.

Finally, we perform a comprehensive pathway and networkbased analysis for the biological validation of the selected genes.
The rest of this paper is organized as follows. In Section 2, we discuss some contemporary related work. In Section , we explain the proposed approach, generalized kernel machine approach to identify higherorder composite effects in multiview datasets and test statistics for statistical testing for higher order composite effects.. In Section , we describe the experiments conducted on both synthesized datasets and two real datasets: imaging and genetics for adolescence brain development and multiomics datasets of osteoporosis studies. We conclude the paper with a discussion on major findings and future research in Section . Detailed derivation of the proposed method along with Satterthwaite approximation to the score test and the applications to real dataset can be found in the supplementary materials.
2 Related work
Kernel based multiview data integration approaches allow the joint analysis of multiple data types to provide a global view of the biological functions and offer insights into the nature of the interactions between the different datasets. These methods offer useful ways to learn how an extensive collection of genetic variants are associated with complex phenotypes and can be used to explore the relationship between genetic markers and a disease state [10, 37, 24].
The marginal, interaction, and composite effect identifications are becoming a common challenge to multidimensional imaging and multiomics data analysis. Preliminary works in kernel machine methods have been boldly pursued by [28], where a single modal dataset was used to test for a genetic pathway effect. [27] also proposed a kernel machinebased method for genegene interactions. They treated each gene as a testing unit for genegene interactions. A kernel machine method was then proposed for detecting multiple factor interactions where a smoothing splineANOVA decomposition method was adopted [27]. However, these approaches only use single or pairwise datasets. A sequence kernel association tests (SKAT) for genome wide association studies is introduced in [21]. While SKAAT is widely used in genomics studies, it is limited to only a single omics. In addition, [14] have proposed a kernel machine method for detecting the effects of interactions between multidimensional variable sets. This is an extended model of [27] to jointly model genetic and nongenetic features and their interactions. A microbiome regression based kernel association test (MiRKAT) has been proposed by [56], which directly regresses the outcome on a single omics dataset, microbiome profiles, through the semiparametric kernel machine regression framework.
Recently, [3] has proposed a kernel machine method for detecting higherorder interactions in three views datasets which was applied to schizophrenia with a continuous trait . Although a generalized version of such methods for more than three datasets as well as both categorical and continuous traits remains elusive, such a method for multiview biomedical data analysis is necessary to discover new information about biological systems and complex diseases.
3 Approach
Suppose we have independent identical distributed (IID) subjects with covariates and mview datasets, . Assuming that follows a distribution in the exponential family with density
(1) 
where and are the location and scale parameters, and are known functions, and
is know weight, respectively. The mean and variance of
satisfy and . In the following generalized semiparametric model, we associate the output with covariates including intercept and view datasets:(2) 
where is a known monotone link function, is a vector of covariates including the intercept for the th subject, is a vector of fixed effects, is an unknown function on the product domain, with .
Error family  Link  Inverse of link  Use 

Gaussian  Identity, g(y)=y  Normally distributed set of data,  
Poisson  Log, g(y)=log(y)  Exponential, exp(y)  equidispersed count set of data 
Binomial  Logit,  Binary set of data, 0/1  
Gamma  Inverse gamma  Positive continuous set of data, 
According to the ANOVA decomposition, the function, can be extended as:
(3) 
where, ’s are the main effects of the respective dataset (), ’s are pairwise interactions effects, ’s are the interactions effects of three datasets and so on. The functional space, RKHS, decomposes as:
(4) 
equipped with an inner product, and a norm
This generalizes kernel regression by allowing the model to be related to the response variable via a link function including kernel regression, logistics kernel regression, and Poisson kernel regression. For the binary data, the link function
provides the logistic kernel regression; for the count data, provides the Poisson kernel regression; for Gaussian data, gives classical kernel regression. The main aspect of this paper is applying this to five views: genome, epigenome, transcriptome, metabolome, and lipidome along with Low BMD and High BMD information of the subject. To do this, assume that we have IID subjects under investigation; is a binary (LBMD or HBMD) phenotype for the th subject. We associate the clinical covariates (e.g., age, weight, height) with five views including genome, epigenome, transcriptome, metabolome, and lipidome.Let denoted the covariates, where is a measure of the th subject.
Let , , , , and , are a gene with SNP markers, a gene with methylation profiles, a gene with RNASeq, a chemical taxonomy class with metabolite profiles, and a chemical class with lipid profiles of the th subject, respectively. Under this setting, Eq. (2), Eq. (3) and Eq. (4) becomes:
(5) 
where .
3.1 Model estimation
We can estimate the function
by minimizing the penalized loss function of Eq. (
5) as:(6) 
where L is a loss function, is a roughness penalty with tuning parameter
. By considering the negative loglikelihood of the binomial distribution,
the problem becomes as a kernel logistic regression problem:
(7) 
It is known that the complete function space of Eq. (3), , has the orthogonal decomposition. Hence the function can be decomposed accordingly (see the supplementary material for details).
It is convenient and mathematically friendly to minimize numerically. Let
be the probability of the
observation andFollowing many in the literature, we have a liner mixed effects model for such that
(8)  
where is a coefficient vector of fixed effects, , , , , , and are independent random effects with normal distributions. The relationship of Eq. (6) and Eq. (8) confirm that all of the effects obtained by optimizing the loss function in Eq. (6), are the equal to the best linear unbiased predictors (BLUPs) of the linear mixed effects model in Eq. (8). This relationship makes a accessibility for testing the variance component instead of testing the nonparametric function in hull hypothesis, which we can estimate using the restricted maximum likelihood (ReML) approach (see the supplementary material for details).
3.2 Statistical testing
In this section, we address the test statistic of the overall effect, marginal effects, different interaction effects, and different composite effects.
3.2.1 Overall hypothesis testing
According to our model, the testing overall effect is
is equivalent to test the variance components in Eq.(8),
It is known that kernel matrices are not blockdiagonal and the parameter in variance component analysis are placed on the edge of the parameter space when the null hypothesis is true. While the asymptotic distribution of a likelihood ratio test (LRT) statistic is neither a chisquare distribution nor a mixture chisquare distribution under the null hypothesis, we can use a score test statistic on the restricted likelihood
[3, 28]. We derive score test statistic for a generalized kernel machine approach in multiview datasets, Eq. (8). Assuming that , where . We can write the restricted loglikelihood function of Eq. (8) as follows(9)  
Using the partial derivative of Eq. (9) with respect to each variance component. We derive the estimate of the variance components and the score test statistic which is defined as
(10) 
where and is the maximum likelihood estimator (MLE) of the logistic regression coefficient under the null model , is the variance of . is quadratic function of the variable , which follows a weighted mixture of the chisquare distribution under the null hypothesis. By the Satterthwaite method, we can approximate the distribution of to a scaled chisquare distribution, (). To estimate the the scale parameter
and the degrees of freedom
we can we can use the method of moments on the mean (
) and variance () of the test statistic and we obtain and . Finally, to get the value of an experimental score statistic we use the scaled chisquare distribution .3.2.2 Testing marginal effects
3.2.3 Testing interaction effect
By considering no marginal effects, i.e., , we can also test the interaction effects. We know that the testing interaction effect
is equivalent to test the variance components in Eq.(8),
The test statistic is
Also by considering no marginal effects, i.e., , and all second order interactions are zero we can also test the 3rd order interaction effects.
The 3rd order test statistic is
and so on.
3.2.4 Testing composite effects
If lower order effects are statistically significant, we want to test higher order effects is called composite hypothesis testing. To test the 3rd order interaction effect, we mansion that testing the null hypothesis is equivalent to testing the variance component: . Let , and all , and are model parameters under the null model.We formulate a test statistic
(11) 
where , and is the projection matrix under the null hypothesis.
Similar to the overall effect test, we can use the Satterthwaite method to approximated the distribution of higher order intersection test statistic by a scaled chisquare distribution with scaled and degree of freedom i.e., . The scaled parameter and degree of freedom are estimated by MOM, and , receptively. In practices, the unknown model parameters are estimated by their respective ReML estimates under the null hypothesis. Lastly, the value of an observed higher order interaction effect test score statistic is obtained using the scaled chisquare distribution . and so on.
3.2.5 Kernel machine based model selection
The result of kernelbased machine learning approaches is highly influenced by the kernel and its parameters. The model selection with a suitable kernel is essential in machine learning. As a result, machine learning approaches suffer from weak selection of a suitable model. Suppose
is a positive definite kernel. A linear positive definite kernel, , where , on can define the similarity measure in the Euclidean space. This liner kernel suffers from more complexity in the function space for highdimensional datasets. Instead of linear kernel, we can use a polynomial kernel to make it possible to use higher order correlations among data points. A polynomial kernel, has two free parameters: c is a tradeoff between higher order and lower order in the polynomial and d is the degree of the polynomial. The dependency measure depends on the finite bounded degree. Another limitation of linear and polynomial kernel is boundedness, i.e., both kernels are unbounded.From a finite dimensional space the Gaussian kernel can map the input space (data space) into a infinite dimensional space [42] and is defined as:
It has a free parameter to select but consist of numerous theoretical properties including boundedness, consistence, universality, robustness, and so on [45]. To fix the free parameter, we can use the median of the pairwise distance[15, 44].
For a multiomics study, to captures the pairwise similarity across a number of SNPs in each gene we can use a identitybystate (IBS) kernel (nonparametric function of the genotypes) [25]:
where is the number of SNP markers of the corresponding gene. It is assumption free on the type of genetic interactions. Hence, it can capture any genetic effects on the phenotype. In this paper, we used the IBS for the genetic dataset and for all other datasets, respectively.
4 Experiments
We conducted experiments on both the Monte Carlo simulation and two real multiview datasets: an imaging and genome datasets from the adolescence brain development and five omics datasets from osteoporosis studies, respectively. The goal is to select a feature based on significant composite effects among the datasets. We considered the Gaussian kernel (the median of the pairwise distance as the bandwidth) For all datasets but the genetic dataset. For the genetic dataset, we considered the IBS kernel [15, 3]. The optimization of the proposed approach is based on Fisher’s scoring algorithm (the ReML algorithm). To that end, we follow the parameters setting as in [3] for the ReML algorithm applicable. To overcome the potential of a local minima, we considered a set of initial points in (0, 1) and picked the point which maximized the ReML algorithm.
We compare our simulation and real data results with three other methods: principal component SKAT, partial principal component regression (pPCAR), and full principal component regression (fPCAR)) as logistic regression setting. The SKAT approach is a flexible and computational in GWAS [51, 21]
. To identify an interaction effect of two genes, a principal component analysis based approach has been proposed by
[27]. [3] has been considered each method as a simple regression setting. Here, we switched each method as a logistic regression setting.4.1 Simulation studies
To evaluate the performance of the proposed method, we do simulation studies. To that end, we use the following model to generate synthesized qualitative phenotype:
(12) 
where is a vector of covariates (height and weight) with an intercept of th subject () and is a coefficient vector, the random error,
, follows the Gaussian distribution with zero mean and unit variance and
is the standard deviation of the error and is fixed to
. Here ’s represent five different datasets. Each function ’s is defined as in different nonlinear form.In this setting, we simulated data under different values of parameters to evaluate the performance of the test. For example, means all effects have vanished and we can examine the false positive rate for the score test under the overall effect. Similarly, we take into account the main effects (2nd order interaction effects) but no higher order interaction effects to evaluate the power of the score test. To get consist results, we performed simulations for each setting.
The simulation results of different methods can be found in Table 2. In this table, we report the power of that higherorder composite score test in different parameter settings. Overall, we made two observations. First, we noticed that the false positive rate of the test for higherorder composite effects score test was controlled by fixing the nominal value threshold to . All four methods have the same observations for the false positive rate. Second, when considering the power analysis (), we observed that the proposed method performs better than other methods and its power exceeds (Table 2). On the other hand we observed that the stateoftheart methods (pPCAR, fPCAR and SKAT) can significantly overstate the false positive rates and result in significant loss of statistical power.
We also plot several visualization of the ROC with all interrelated parameters () values are random but the linear parameter is fixed to . We allocate each number with probability
. A random number is uniformly distributed either in a range or at
. In Figure 2, we look into the results from the receiver operating characteristic (ROC) with three sample sizes, . The sensitivity is plotted against (1 specificity) for each value in the range of with a step size . The power gain of the proposed method relative to the alternative ones is evident in all situations.Parameters  Simulation  I  

GKMAHCE  Stateoftheart methods  
pPCAR  fPCAR  SKAT  
(, , , , )  
(0.1, 0, 0, 0, 0)  
(0, 0, 0, 0, 0.1)  
(0, 0, 0, 0, 0.5)  
(0, 0, 0, 0, 1)  
(1, 1, 1, 1, 0.1)  
(1, 1, 1, 1, 0.5)  
(1, 1, 1, 1, 1) 
4.2 Real data analysis
We apply the proposed method to two real multiview datasets including an fMRI imaging dataset and five omics datasets from the adolescence brain development and osteoporosis studies, respectively.
4.2.1 Application to adolescence brain development
The Philadelphia Neurodevelopmental Cohort (PNC) is a research initiative that focuses on characterizing brain and behavior interaction with genetics. Sex differences in human brain are essential to understand their anatomical foundations in the brain [40, 49]. Our goal here is to select features using the test composite effect on the sex differences in cognition in youth. To that end we consider three view datasets: genomic, nongenomic and fMRI imaging datasets. Table 3 presents the configuration of the three datasets along with phenotype. The gene consists of genetic features (SNPs); the nongenomic data consists of clinical measurements; and the ROI consists of imaging features (voxels). We consider each gene, all nongenomic data and each ROI as a unique testing unit. To apply the proposed method, first we selected most significant genes out of genes using the SKAT approach and ROIs out of (since ROIs contain missing values). In total, we get ( triplets to test. The overall test gives us significant triplets () out of . In Figure 3, we visualize the plot of for triplets. The vertical solid, doted and double doted lines correspond to the pvalues of , and , respectively. At pvalues of and we detected and significant genes, as well as and significant ROIs, respectively.
Genomic  Nongenomic  Imaging  
SNP  Clinical measurement  Voxel  
95639  116  921429  
Feature  Gene  ROI  
13150  264  
Subject  863  8719  897 
common  798  
Covariate  Age  
Phenotype  Gender  
421 male and 377 female 
Table 4 presents the ReML estimates of all parameters, , , , , , , , and the values for both the proposed and SKAT methods for each of the triplets. By the proposed method, these triplets were identified to have significant interactions at a level of . At this value, we observe unique genes (RPA3OS, ESR1, WWOX, RGS7, PALLD) and ROIs (CFL.L: Left cerebrum frontal lobe; CSL.R: Right cerebrum sub lobar; CLL.R: Right cerebrum limbic lobe; CSS.R: Right cerebrum cublobar; CFL.R: Right cerebrum frontal lobe), that may have significant effects of sex on adolescence brain development [5, 33].
To test whether genes interact with each other, we constructed a genegene interaction network analysis using STRING which imports gene/protein association knowledge from databases on both physical interactions and curated biological pathways [47]. In STRING (e.g. STRING 11.0), the simple interaction unit is the functional relationship between two genes. This relationship can play a role to a common biological purpose. Figure 4 illustrates the genegene network based on the protein interactions with the selected genes ( including 7 individual selected genes). In this figure, the color saturation of the edges represents the confidence score of a functional association. Further network analysis shows that the number of nodes, number of edges, expected number of edges, average node degree, clustering coefficient, values are , , , , for , respectively. This analysis confirmed that the selected genes had significantly more interactions than expected and indicates that these genes may function in a concerted manner.
GKMAHCE  SKAT  

Genetics  Imaging  OVA  HOI  HOI  
CM  
For investigating classification accuracy of the proposed method, the genome, nongenome, fMRI imaging datasets and only selected features via the proposed method are used to classify the sexrelated brain differences followed by the three classifiers: the Knearest neighbors algorithm (KNN) and the support vector machine (SVM) (the linear kernel and the Gaussian kernel). For the proposed approach, we considered triplets that have significant effects on the gender:
genes and ROIs only (from Table 4). The classifier is used to distinguish the adolescence brain development between the female and the male. Table 5presents the train classification error with different datasets and the selected features only. These results demonstrated that the proposed method is a better tool for feature selection.
Dataset  KNN algorithm  Linear kernel  Gaussian kernel 

Genome  23.06  
Imaging  
Nongenome  
Only selected features  21.05  0.07 
4.2.2 Application to osteoporosis study
Osteoporosis is a bone disorder which is largely attributable to the increased bone resorption and/or decreased bone formation by osteoclasts and osteoblasts, respectively. We apply the proposed method to a multiomics datasets from an osteoporosis study [54, 32, 4]. This dataset included genome, epigenome, transcriptome, metabolome, and the lipidome data from Caucasian females with high BMD and with low BMD. Table 6 presents the configuration of the five datasets along with phenotype. The original genomic data consists of SNPs which annotated to 25, 442 genes. The epigenome data consists of methylations which annotated to genes. This methylation analysis focus on gene level/sliding windows. The transcriptome data has genes expression profiles. The metabolome data contains of chemical taxonomies including metabolic profiles. The lipidome data covers chemical taxonomies including lipid profiles.
Genome  Epigenome  Transcriptome  Metabolome  Lipidome  
SNP  Methylation  RNAsequencing  profiling  profiling  
3997535  46690  291  56  
Feature  Gene  CT class  
25442  4676  22682  27  7  
Subject  129  128  128  136  136 
common  108  
Covariate  Age, Height and Weight  
Phenotype  Bone mineral density (BMD)  
Low BMD and High BMD 
values  Gene  Chemical taxonomy  
Genome  Epigenome  Transcriptome  Metabolome  Lipidome  
Genome  ABCB8  ACTBP13  AP003122  DOCK10  EXOSC3P2  FGF5  GPLD1  LINC00461  CHGB  NIPA2P5  NIPAL1 

PANK1  RPL13AP3  RPLP2  SCRN2  TRIM65  ZNF114  ZNF35  
Epigenome  A1CF  ALLC  AP1G1  ATP2B2  BACH2  CA13  CCDC111  DNER  DYM  FASTKD2  GCNT7 
HCST  INPP5F  KIAA1305  KNDC1  NGTN  LGTN  LMNB1  LOC283871  MAPKAPK2  MAST4  MDC1  
NOC2L  NRL  SLC44A3  TRAIP  UBE2J2  VPS26A  ZNF700  
Transcriptome  ABI1  ACER3  ADCY3  ALOX15B  AUNIP  BSND  BTNL3  CDKN1A  C10orf105  CROCC  CSMD3 
CTSLP2  GALT  HEXDC  HLADRA  HM13  HMBOX1  IGHMBP2  KLRC1  LRRC8C  MAGOH  MCTS2P  
NEURL4  NIN  NPHS1  PRPF38B  RALBP1  RNF17  SERPINB6  SLC38A9  SOX1  TECTA  TGFBR3L  
USP17L17  USP53  
Metabolome  Azoles  BENZ  BSD  CAD  CBS  CXAD  DIA  FAE  FAL  HAD  HDMC 
IDA  LAC  PHE  PLA  PLL  PRNS  PYRD  SPHG  STSD  TTRD  
Lipidome  AA  DGLA  DHA  EA  EPA  ISTD  LA 
Gene  GeneHancer  GeneHancer  Association  Total  Source  Related 
ID  ID  Score  Score  Score  (Total No.): PMID  Disease 
ABL1  GH09J130833  : , ; ;  Bone marrow; Brain, Endometrium; Lymphy node  
AP1G1  GH16J071807  ; ; ; ;  Bone marrow; Kidney; Brain, Testis; Thyroid  
GCNT7  GH20J056524  : ;  Bone marrow; thyroid  
PANK1  GH20J056524  : ; ; ; ;  Liver; Kidney; Bone marrow; Thyroid  
ZNF35  GH03J044648  : ; ; ; ;  Testis; Thyroid; Urinary bladder, Endometrium; Brain, Bone marrow  
ZNF114  GH07J151024  : ; ; ; ;  Thyroid, Brain; Placenta 
To apply the proposed method, we considered each gene of the genome, epigenome, and transcriptome data and each chemical taxonomy of the metabolome and lipidome data as a single testing unit. To make it feasible, we reduced the dimensionality of the genome, epigenome and transriptome data using four approaches including the kernel based gene shaving approach [4]. Figure 5
presents the Venndiagram of the ttest, canonical correlation analysis based gene shaving (CCAOut), kernel canonical correlation analysis based gene shaving (KCCAOut) and Linear Models for Microarray based gene shaving (LIMMA) methods for three datasets: (a) genome data (b) epigenome data, and (c) transcriptome data. From this figure, we observed that the number of disjointedly selected genes from CCOut, LCCAOut, KCCAOut and LIMMA methods for the genome data are
, , , ; for the epigenome data are , , , ; and for the transcriptome data are , , , , respectively. All methods selected , and common genes for the genome, epigenome and transcriptome data, respectively. In addition, we have and chemical taxonomies of metabolome and lipidome dataset, respectively. In total, we get ( quinlets to test. Figure 6, shows the plot of for quinlets. The overall test gives us significant quinlets (). The vertical solid, doted and double doted lines correspond to the pvalues of , and , respectively. Table 7 presents the number of significant genes and chemical taxonomies using different adjusted pvalues of the proposed method. A list of genes (genome, epigenome and transcriptome data) and list of chemical taxonomies (metabolome and lipidome data ) are tabulated in Table 8.We also extracted the genegene interaction networks using STRING: functional protein association networks. Figure S1, (in the Supplementary material) shows the genegene networks based on the protein interactions among the selected () genes by the proposed method of the genome, epigenome, and transcriptome datasets. In this figure, the color saturation of the edges represents the confidence score of a functional association. In addition, network analysis demonstrates that the number of nodes, number of edges, expected number of edges, average node degree, clustering coefficient, PPI enrichment pvalues are and , respectively. This network analysis confirms that the selected genes have significantly more interactions than expected. It also indicates that the selected genes may function collaboratively.
To confirm the biological roles of the most prominent identified features, we accessed the biomedical and genomic information using the national center for biotechnology information (NCBI, https://www.ncbi.nlm.nih.gov/) tools and the Human gene database (GeneCards: https://www.genecards.org/) [13]. NCBI offers a wide variety of data analysis tools for manipulating, aligning, visualizing, and evaluating biological data. On the other hand, the primary goal of the GeneCards database is the unequivocal identification of enhancer elements and uncovering their connections to genes for understanding gene regulation and the molecular pathways. Recent studies have shown that the bone marrow, kidney, testis, and thyroid diseases have widespread systemic manifestations including their effects on developing osteoporosis. [38, 22, 48, 50]. So, we hypothesized that the selected genes, with the proposed method, would provide more ubiquitous expression (reads per kilo base per million mapped reads (RPKM)) of bone marrow, kidney, testis, and thyroid tissue samples. To that end, we used NCBI’s tools. Figure 7 illustrates the expression patterns of three selected genes (ABL1, GCNT7, ZNF144) across different tissues from 95 human individuals. These results show that these genes have remarkably high RPKM value for bone marrow, kidney, testis, and thyroid tissue samples. These tissue related diseases including their effects may play a role in the development of osteoporosis. To provide insight into the gene regulatory elements (promoters and enhancers) for 6 selected genes (ABL1, AP1G1, GCNT7, PANK1, ZNF35, ZNF144), we used the GeneCards database. Table 9 presents GeneHancer identifier, GeneHancer score, gene association score, total score, majorrelated diseases, and PubMed database. From this table, we observed that the selected genes have had a remarkable GeneHancer score, gene association score, total score, and literature review in the past studies. According to the disease annotation, the selected genes are highly associated with complex diseases, which are higher risk of developing osteoporosis (the results of all selected genes are in Supplementary material).
To further explore whether that the selected chemical taxonomies of metabolic and lipidomic features will have unique relationships in the low BMD and high BMD groups. We infer a causal relationship with a HillClimbing approach among the select chemical taxonomies in each group of both datasets (metabolome and lipidome). The causal relationship of the metabolome and the lipidome datasets: (a) low BMD and (b) high BMD is illustrated in Figure 8 and Figure 9, respectively. In both datasets, we observed that the causal relationships are different in both the low BMD and high BMD for both datasets. For the metabolic data TTRD has an important impact on the low BMD but has an negative impact on high BMD but DA has an important impact on the high BMD. In case of the lipidome data, we observed that the LA has an impact on the low BMD but does not have any impact in the high BMD. By this observation, we concluded that the selected chemical taxonomies may contribute on BMD status but not in general.
Finally, we investigated classification accuracy via only selected features of the proposed method. To classify the low BMD subjects from the high BMD subjects followed by the three classifiers: the KNN and the SVM (the linear kernel and the Gaussian kernel). For the proposed approach, we considered triplets that have significant (from Table 8) effects on the BMD. The classifier is used to classify the low BMD subject from the high BMD subject. Table 10 presents the train classification error with different datasets and the selected features only. These results are also demonstrating that the proposed method is a better tool for feature selection.
Dataset  KNN algorithm  Linear kernel  Gaussian kernel 

Genome  
Epigenome  
Transcriptome  
Metabolic  
Lipidomic  
Only selected feature 
5 Concluding remarks
The technology of biomedical science has accelerated the cycle of discovery. In this we propose a novel generalized kernel machine approach to identify higherorder composite effects in multiview biomedical datasets. This semiparametric approach considers multiview data as prediction variables to allow comprehensive modeling of complex disease trait.
The power of the proposed method is further demonstrated by its application to both synthesized and two real multiview datasets, i.e., adolescence brain development and osteoporosis study. In simulation studies, we have shown that the false positive rate of the test for higherorder composite effect is mitigated by fixing the nominal pvalue threshold along with other stateof theartmethods. From the power analysis, our proposed method not only performs better than other methods. The ROC curves also showed the gain of power by the proposed method in all scenarios.
In adolescence brain development experiments, we found five unique genes and 6 ROIs along with the nongenomic factor that may have significant gender effects on adolescence brain development. In addition, by the genegene interaction networks, we confirmed that the selected genes have significantly more interactions than expected. This result implies that the gens and their functional interactions form the backbone of the cellular machinery.
In the osteoporosis study, the network analysis confirmed that the selected genes of genome, epigenome, and transcriptome datasets also have significantly more interactions than expected. It also indicates that the selected genes may function in a collaborative effort. We found that they function in a concerted effort and have biological relevance to osteoporosis. Furthermore, using a causal relationship, we conclude that the selected chemical taxonomies (metabolome and lipidome) may contribute to development of BMD status.
The primary focus of our paper is to comprehensively characterize complex disease (e.g., adolescent brain development, or osteoporosis ). Collectively, we can use this approach to derive a statistic for testing the composite effect. These studies explore novel genomic, epigenomic, transcriptomic and chemical mechanisms to identify corresponding factors. The significant triplet, quartet, quintet of extracted features suggest that these effects might highlight biological targets for drug development. Extrapolating these findings enable an advanced understanding of normal cellular processes. Our proposed method can be applied to the study of any disease models, where multiview data analysis is commonly used.
We acknowledge that the kernel machine is sensitive to contaminated data, even if bounded positive definite kernels are used. Further work on robust kernel methods including Mestimator based methods. They will allow us to uncover the rich, hidden connections in biological system, and may provide a more comprehensive modeling of complex traits.
Acknowledgments
This work was partially supported by grants from National Institutes of Health (NIH) [U19AG05537301, R01AR069055, P20GM109036, R01MH104680, R01AG061917], and the Edward G. Schlieder Endowment and the Drs. W. C. Tsai and P. T. Kung Endowment to Tulane University.
References
 [1] M. A. Alam. Kernel Choice for Unsupervised Kernel Methods. PhD. Dissertation, The Graduate University for Advanced Studies, Japan, 2014.

[2]
M. A. Alam, V. Calhoun, and Y. P. Wang.
Influence function of multiple kernel canonical analysis to identify outliers in imaging genetics data.
Proceedings of 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB),Seattle, WA, USA, pages 210–2198, 2016.  [3] M. A. Alam, HuiYi Lin, HongWen Dengc, V. Calhoun, and Y. P. Wang. A kernel machine method for detecting higher order interactions in multimodal datasets: Application to schizophrenia. Journal of Neuroscience Methods, 309:161–174, 2018.
 [4] M. A. Alam, Mohammd Shahjaman, Md. Ferdush Rahman, Fokhrul Hossain, and HongWen. Gene shaving using a sensitivity analysis of kernel based machine learning approach, with applications to cancer data. PLOS One, 14(5):e0217027, 2019.
 [5] Peter F Bladin Michael M Saling Amee D Baird, Sarah J Wilson and David C Reutens. Neurological control of human sexual behaviour: insights from lesion studies. Journal of Neurology, Neurosurgery & Psychiatry, 74(10):1042–1049, 2007.
 [6] Ricard Argelaguet, Britta Velten, Damien Arnol, and et al. Multiomics factor analysis—a framework for unsupervised integration of multiomics data sets. Molecular System Biology, 14:e8124, 2018.
 [7] Matteo Bersanelli, Ettore Mosca, Daniel Remondini, and et al. Methods for the integration of multiomics data: mathematical aspects. BMC Bioinformatics, 17 (15):167–202, 2016.
 [8] K. M. Borgwardt and et al. Protein function prediction via graph kernels,. Bioinformatics, 21:i47–i56, 2005.
 [9] Joana Cabral, Morten L. Kringelbacha, and Gustavo Decoc. Functional connectivity dynamically evolves on multiple timescales over a static structural connectome: Models and mechanisms. NeuroImage, 160:84–96, 2017.
 [10] G. CampsValls, J. L. RojoAlvarex, and M. MartinezRomon. Kernel Methods in Bioengineering, Signal and Image. Idea Group publishing, London, 2007.
 [11] Sebastian Canzler, Jana Schor1, Wibke Busch, Kristin Schubert, and et al. Prospects and challenges of multi‑omics data integration in toxicology. Archives of Toxicology, 94:391–388, 2020.
 [12] Gustavo Deco, Giulio Tononi, Melanie Boly, and Morten L. Kringelbach. Rethinking segregation and integration: contributions of wholebrain modelling. Nature Reviews Neuroscience, 2015.
 [13] Simon Fishilevich, Ron Nudel, and et al. Genehancer: genomewide integration of enhancers and target genes in genecards. Database, 2017:1–17, 2017.
 [14] T. Ge, T. E. Nichols, D. Ghoshd, E. C. Morminoe, am M. R. Sabuncu J. W.Smoller, and the Alzheimer’s Disease Neuroimaging Initiative. A kernel machine method for detecting effects of interaction between multidimensional variable sets: An imaging genetics application. NeuroImage, 109:505–514, 2015.
 [15] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, 20:585–592, 2008.
 [16] Enrique C.A. Hansen, Demian Battaglia, Andreas Spiegler, Gustavo Deco, and Viktor K. Jirsa. Functional connectivity dynamics: Modeling the switching behavior of the resting state. NeuroImage, 105:525–535, 2015.
 [17] Y. Hasin, M. Seldin, and A. Lusis. Multiomics approaches to disease. Genome Biol, 18 (83):1:15, 2017.
 [18] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2009.
 [19] T. Hofmann, B. Schölkopf, and J. A. Smola. Kernel methods in machine learning. The Annals of Statistics, 36:1171–1220, 2008.
 [20] S. Huang, K. Chaudhary, and L. X. Garmire. More is better: Recent progress in multiomics data integration methods. Frontiers in Genetics, 8:Article 8, 2017.
 [21] I. IonitaLaza, S. Lee, V. Makarov, J. D. Buxbaum, and X. Lin. Sequence kernel association tests for the combined effect of rare and common variants. American Journal of Human Genetics, 92:841–853, 2013.
 [22] Pascale Khairallah and Thomas L. Nickolas. Management of osteoporosis in ckd. Clinical Journal of the American Society of Nephrology, 13(6):962–969, 2018.
 [23] Minseung Kim and Ilias Tagkopoulos. Data integration and predictive modeling methods for multiomics datasets. Molecular Omics, 14:8–25, 2018.
 [24] S. Y. Kung. Kernel Methods and Machine Learning. Cambridge University Press, New York, 2014.
 [25] X. Lin D. Ghosh M. P. Epstein L. C. Kwee, D. Liu. A powerful and flexible multilocus association test for quantitative traits. Annals of Human Genetics, 82(2):386–397, 2008.
 [26] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20:2626–2635, 2004.
 [27] S. Li and Y. Cui. Genecentric genegene interaction: a modelbased kernel machine method. The Annals of Applied Statistics, 6(3):1134–1161, 2012.

[28]
D. Liu, X. Lin, and D. Ghosh.
Semiparametric regression of multidimensional genetics pathway data: least squares kernel machines and linear mixed model,.
Biometrics, 630(4):1079–1088, 2007.  [29] Feiyang Ma, Brie K. Fuqua adn Yehudit Hasin, and et al. A comparison between whole transcript and 3’ rna sequencing methods using kapa and lexogen library preparation methods. BMC Genomics, 20 (9):1–12, 2019.

[30]
Tianle Ma and Aidong Zhang.
Integrate multiomics data with biological interaction networks using multiview factorization autoencoder (MAE).
BMC Genomics, 20 (11): 994::1–11, 2019.  [31] A. C. A. Nascimento, R. B. C. Prudêncio, and Ivan G. Costa. A multiple kernel learning algorithm for drugtarget interaction prediction. BMC Genomics, 17:46:1–16, 2016.
 [32] Chuan Qiu, Fangtang Yu, Kuanjui Su, Lan Zhang, and et al. Multiomics data integration for identifying osteoporosis biomarkers and their biological interaction and causal mechanisms. iScience, 23(2):100847, 2020.
 [33] W. B Bilker R. C. Gur, F. GunningDixon and et al. Sex differences in temporolimbic and frontal brain volumes of healthy adults. Cereb Cortex, 12(9):998–1003, 2002.
 [34] N. Rappoport and R. Shamir. Nultiomic and multiview clustering algorithms: review and cancer benchmark. Nucleic Acids Research, 46 (20):10546–10562, 2018.
 [35] Nicholas J. W. Rattray, Nicole C Deziel, Joshua D. Wallach, and et al. Beyond genomics: understanding exposotypes through metabolomics. Human Genomics, 12 (4):1–14, 2018.
 [36] Marylyn D. Ritchie, Ruowang Li Emily R. Holzinger and, Sarah A. Pendergrass1, and Dokyoon Kim. Methods of integrating data to uncover genotype–phenotype interactions. Nature Review: Genetics, 16:88– 97, 2015.
 [37] B. D. Moor S. Yu, LC. Tranchevent and Y. Moreau. Kernelbased Data Fusion for Machine Learning. Springer, Verlag Berlin Heidelberg, 2011.
 [38] A. SanghaniKerai, L. OsagieClouard, G. Blunn, and M. Coathup. The influence of age and osteoporosis on bone marrow stem cells from rats. Bone & Joint Research, 7(4):289–297, 2018.
 [39] Anita Sathyanarayanan, Erik W Thompson Rohit Gupta, and et al. A comparative study of multiomics integration tools for cancer driver gene identification and tumour subtyping. Briefings in Bioinformatics, bbz121:e8124, 2019.
 [40] F. E. Satterthwaite. An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6):110–114, 1946.
 [41] Andreas Schmidt, Ignasi Forne, and Axel Imhof. Bioinformatic analysis of proteomics data. BMC Systems Biology, 8 (53):1–7, 2014.
 [42] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge MA, 2002.
 [43] T Sölzen and et al. An overview and management of osteoporosis. European Journal of Rheumatology, 4:46–56, 2017.
 [44] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13:1393–1434, 2012.

[45]
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and
B. Schölkopf.
Kernel choice and classifiabilit forRKHS embeddings of probability distributions.
Advances in Neural Information Processing Systems, 21:1750–1758, 2009.  [46] Jing Sui, Rongtao Jiang, Juan Bustillo, and Vince Calhoun. Neuroimagingbased individualized prediction of cognition and behavior for mental disorders and health: Methods and promises. Biological Psychiatry, BPS 14136:Articles in Press, 2020.
 [47] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. HuertaCepas, M. Simonovic, h A. Rot, A. Santos, M. Kuhn K. P. Tsafou, and, P. Bork, L. J. Jensen, and C. von Mering. STRING v11.0: Proteinprotein interaction networks, integrated over the tree of life. Nucleic Acids Research, 43:531–543, 2007.
 [48] Dominika Tuchendler and Marek Bolanowski. The influence of thyroid dysfunction on bone metabolism. Tuchendler and Bolanowski Thyroid Research, 7(12):1–5, 2014.
 [49] Birkan Tunc, Berkan Solmaz, and et al. Establishing a link between sexrelated differences in the structural connectome and behaviour. Philosophical Transactions Society B, 371:20150111, 2015.
 [50] P. M. Willemse, N. A. T. Hamdy, and et al. Changes in bone mineral density in newly diagnosed testicular cancer patients after anticancer treatment. The Journal of Clinical Endocrinology & Metabolism, 99(11):4101–4108, 2014.
 [51] M. C. Wu, S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT). American Journal of Human Genetics, 89:82–93, 2011.
 [52] K. K. Yan, H. Zhao, and H. Pang. A comparison of graph and kernelbased – omics data integration algorithms for classifying complex traits. BMC Bioinformatics, 18:539:1–13, 2017.
 [53] Kui Yang and Xianlin Han. Lipidomics: Techniques, applications, and outcomes related to biomedical sciences. Trends in Biochemical Sciences, 41 (11):954–969, 2016.
 [54] Fangtang Yu, Chuan Qiu, Chao Xu, , and et al. Mendelian randomization identifies cpg methylation sites with mediation effects for genetic influences on bmd in peripheral blood monocytes. frontiers in Genetics, 11(60):1–14, 2020.
 [55] Irene Sui Lan Zeng and Thomas Lumley. Review of statistical learning methods in integrated omics studies (an integrated information science). Bioinformatics and Biology Insights, 12:1–16, 2020.
 [56] Ni Zhao, Jun Chen, Ian M. Carroll, and et al. Testing in microbiomeprofiling studies with mirkat, the microbiome regressionbased kernel association test. The American Society of Human Genetics, 96:979–807, 2015.
 [57] Wei Zheng, Hongfei Lin, and Zhehuan Zhao. A graph kernel based on context vectors for extracting drug–drug interactions. Journal of Biomedical Informatics, 61:34–43, 2016.
Comments
There are no comments yet.