The advancements in data science technology over the last decade has rapidly evolved to collect multi-view data, which has emerged to provide a comprehensive way to explore statistical structures and information embedded in the relationship between datasets. The integration of imaging and genetic information into a format capable of predicting disease phenotypes, however, continues to be challenging problem.
One of the goals of imaging genetics is the modeling and understanding of how genetic variations influence the structure and function of brain disease. This goal can be achieved by collating multimodal data including functional magnetic resonance imaging (fMRI), structural MRI (sMRI), and positron emission tomography (PET) scans with single nucleotide polymorphisms (SNPs), deoxyribonucleic acid (DNA) methylations, gene expression (GE), transcriptomics, epigenomics, and proteomics factors. Numerous studies have suggested that these different factors do not act in isolation, but rather they interact at multiple levels and depend on one another in an intertwined manner Calhoun Sui (2016); Pearlson . (2015). Extracting the interaction effects from within and among data sets, however, remains a challenge for multi-view data analysis J. Li . (2015); Chekouo . (2016); Zheng . (2015); Zhao . (2016); M. Liu . (2016). Figure 1 illustrates how the interaction effects of different data sets can be used to model and predict human illness.
To date, both genetic techniques and brain imaging have played a substantial role in detecting disease phenotypes. For example, by correlating imaging and genetic data, it has been shown that certain genes affect specific brain functions, connectivity, and serve as risk predictors for certain diseases. Jahanshad . (2012); Lin . (2014); Bis . (2012); Jahanshad X. Hua (2013). Additionally, Bis . (2012) have identified genetic variants affecting the volume of the hippocampus, which could be used as predictors of cognitive decline and dementia Jahanshad X. Hua (2013). As shown in Wen . (2017), accurate identification of Tourette’s syndrome in children has notably improved using multi-view features as compared to relying solely on one view. Accumulating evidence also shows that the inherent genetic variations for complex traits can sometimes be explained by the joint analysis of multiple genetic features with environmental factors.
Schizophrenia (SZ) is a complex brain disorder that affects how a person thinks, feels and acts, which is thought to be caused through an interplay of genetic effects, brain region, and DNA methylation abnormalities Richfield . (2017). Studies using neurological tests and brain imaging technologies (fMRI and PET) have been used to examine functional differences in brain activity that seem to arise within the frontal lobes, hippocampus and temporal lobes Van Kapur (2009); Kircher Renate (2005). Many researchers have shown that genetic alterations at the mRNA and SNP level, however, also play a significant role in SZ Chang . (2013); Lencz . (2007). Thus, only focusing on brain imaging data is not sufficient in the identification of the related risk factors for SZ Potkin . (2015). To address this, Chekouo . (2016) have developed the ROI-SNP network for the selection of discriminatory markers using brain imaging and genetics information.
A number of studies suggest that epigenetics also has a role in SZ disease susceptibility. Genome-wide DNA methylation analysis of human brain tissue from SZ patients shows a heritable epigenetic modification, which can regulate gene expression. The cell specific differences in chromatin structure that influence cell development, including DNA methylation, have emerged as a potential explanation for the non-Mendelian inheritance of SZ Wockner . (2014). There is also evidence on epigenetic alterations in the blood and central nervous system of patients with SZ, and it has been shown that methylation status in brain tissue from SZ patients varies significantly from controls Aberg . (2014); Montano . (2016). In this paper, we consider the interaction effects among the genetics, brain imaging, and epigenetics data on hippocampal volume measurements between SZ patients and healthy controls using a novel kernel method for detecting these higher order interactions.
Many advancements in multimodal fusion methods have utilized such approaches as co-training, multi-view learning, subspace learning, multi-view embedding, and kernel multiple learning, to analyze multi-view data of biological relevance Xu . (2013). However, due to the large number of genes, SNPs, DNA methylations and different types of imaging, positive definite kernel based methods have become a popular and effective tool for conducting genome-wide association studies (GWASs) and imaging genetics, especially for identifying genes associated with diseases S. Li Cui (2012); Ge . (2015); Alam, Calhoun Wang (2016); Alam, Komori . (2016). Kernel methods are emerging as innovative techniques that map data from high dimension input spaces to a kernel feature space using a nonlinear function. The main advantage of these methods is to combine statistics and geometry in an effective way Hofmann . (2008). Kernel methods offer useful algorithms to learn how a large number of genetic variants are associated with complex phenotypes, to help explore the relationship between the genetic markers and the outcome of interest Camps-Valls . (2007); S. Yu Moreau (2011); Alam (2014); Alam Fukumizu (2015); Schölkopf . (1998); Kung (2014).
In genetics, the detection of gene-gene interactions or co-associations in most methods are divided into two types: SNP based and gene-based methods in GWASs. In the last decade, a number of statistical methods have been used to detect gene-gene interactions (GGIs). Logistic regression, multifactor dimensionality reduction, linkage disequilibrium and entropy based statistics are examples of such methodsHieke . (2014); Wan . (2010). While most of these methods are based on the unit association of the SNPs, testing the associations between the phenotype and SNPs has limitations and is not sufficient for interpretation of GGIs Yuan . (2012). In GWASs, gene-based methods are always more effective than the ones based only on a SNP, and powerful tools for multivariate gene-based genome-wide associations have been proposed Sluis . (2015).
In recent years, linear, kernel, and robust canonical correlation based U statistic have been utilized to identify gene-gene co-associations Peng . (2010); Alam, Komori . (2016). S. Li Cui (2012) have proposed a model-based kernel machine method for GGIs. In addition, Ge . (2015) have also proposed a kernel machine method for detecting effects of interactions between multi-variable sets. This is an extended model of S. Li Cui (2012) to jointly model the genetics and non-genetic features, and their interactions. While these methods could ultimately shed light on novel features of the etiology of complex diseases, they cannot be reliable used in multi-view data sets. Thus, there exists a need to extend kernel machine based methods.
The contribution of this paper, therefore, is threefold. By examining the three-way interaction effects between triplet data sets combining genetics, imaging, and epigenetics, we hope to shed light on the phenotype features associated with disease mechanisms. This is done iteratively. First, we propose a novel semiparametric method on a reproducing kernel Hilbert space (RKHS) to study the interaction effects among the multiple-view datasets. We name a kernel method for detecting higher order interactions (KMDHOI) and include the pairwise and higher order Hadamard product of the features from different views. Second, we formulate the problem as a standard mixed-effect linear model to derive a score-based variance component test for the higher order interactions. The proposed method offers a flexible framework to account for the main (single), pairwise, triplet, other higher order effects and test for the overall higher order effects. Finally, we validate the proposed method on both simulation and the Mind Clinical Imaging Consortium (MCIC) data J. Chen . (2012); Gollub . (2013).
The remainder of this paper is organized as follows. In Section 2, we propose a standard mixed-effects linear model to derive score-based variance component test for higher order interaction. In Section 3, we propose statistical testing for higher order interaction effects. The relevant methods are discussed in Section 4. In Section 5, we describe the experiments conducted on both synthesized and the imaging genetics data sets. We conclude the paper with a discussion of major findings and future research in Section 6. Details of the theoretical analysis for the proposed method, Satterthwaite approximation to the score test, and supplementary tables and figures on application to imaging genetics and epigenetics can be found in the appendix.
In kernel methods, the nonlinear feature map is given by a positive definite kernel, which provides nonlinear methods for data analysis. It is known Aronszajn (1950) that a positive definite kernel is associated with a Hilbert space , called reproducing kernel Hilbert space (RKHS), consisting of functions on so that the function value is reproduced by the kernel; namely, for any function and a point , the function value is where in the inner product of is called the reproducing property. Replacing with yields for any . A symmetric kernel defined on a space is called positive definite, if for an arbitrary number of points the Gram matrix is positive semi-definite. To transform data for extracting nonlinear features, the mapping is defined as which is a function of the first argument. This map is called the feature map
, and the vectorin is called the feature vector. The inner product of two feature vectors is then This is known as the kernel trick. By this trick the kernel can evaluate the inner product of any two feature vectors efficiently without knowing an explicit form of .
2.1 Model setting
Assuming that we have independent identical distributed (IID) subjects with covariates and m-view datasets, . In the following semiparametric model, we associate the output with covariates including intercept and -view datasets:
where is a vector of covariates including intercept for the th subject, is a vector of fixed effects, is an unknown function on the product domain, with and the error ’s are IID as normal with mean zero and variance , . According to the ANOVA decomposition, the function, can be extended as:
where ’s () are the main effects for the respective dataset, are pairwise interactions effects, are the interactions effects of the three dataset and so on. The functional space, RKHS, is decomposes as:
equipped with an inner product, and a norm If , Eq. (1) becomes simple semiparametric regression model as shown in D. Liu . (2007). S. Li Cui (2012) and Ge . (2015) have proposed similar models (special case of Eq. (1), ) for detecting interaction effects among multidimensional variable sets.
Specifically, in our case we have three data sets. To do this, we assume that we have IID subjects under investigation; is a quantitative phenotype for the -th subject (say, hippocampal volume derived from structural MRI scan). We associate the clinical covariates (e.g., age, weight, height) with three views: genetics, imaging, and epigentics (gene-derived SNP, ROIs, and gene-derived DNA methylation). Let denote the covariates, where is a measure of the -th subject. Let , and be a genes-derived SNP with SNP markers, a ROI with voxels of the fMRI scan, and a gene-derived DNA methylation with methylation profiles of the -th subject, respectively. Under this setting, Eq. (1), Eq. (2) and Eq. (3) become:
respectively. Here , and , and , and , and are RKHSs functions on , and , and , and and , respectively. The notation is a direct sum of RKHS.
2.2 Model estimation
We can estimate the function
by minimizing the penalized squared error loss function of Eq. (4) as:
where is a roughness penalty with tuning parameter . It is known that the complete function space of Eq. (6), , has the orthogonal decomposition. Hence the function can be decomposed accordingly. Eq. (7) then becomes:
where , ,
, , , , , , , and are the positive tuning parameters that trade-off between the model fits and its complexity.
By the representer theorem Kimeldorf Wahhba (1971); Schölkopf Smola (2002) and the fact that the reproduction kernel of a product of an RKHS is the product of the reproducing kernels Aronszajn (1950), the expanded functions of in Eq.(2.2) for arbitrary , and can be written as:
For each data view, we can define the kernel matrices: , , , , , and , where is denoted as the element-wise product of two matrices. Now we have
where , , , , , and .
Substituting , , , , , and into Eq. (2.2), and applying the reproducing kernel properties, we get
where and .
The gradients of with respect to the parametric coefficients and nonparametric coefficients are
By setting the gradients to zero, this first-order condition is given by the linear system as follows:
where , , , , , . Following many derivations in the literature (e.g., D. Liu . (2007); S. Li Cui (2012); Ge . (2015)), we can show that a first-order linear system is equivalent to the normal equation of the linear mixed effects model:
where is a coefficient vector of fixed effects, , , , , , and are independent random effects with distribution as , , , , , , .
is also an independent random variable with the distribution, where
is an identity matrix. This relationship insures that all of the effects extracted by minimizing the loss function in Eq. (7), are the same as the best linear unbiased predictors (BLUPs) of the linear mixed effects model in Eq. (13). It is possible to estimate the variance components using the restricted maximum likelihood (ReML) approach (see in the appendix for details). The solution of the linear system in Eq. (2.2) gives the coefficients of the fixed effect, , and coefficients for the random effect, . By inserting into Eq. (2.2), we can estimate the random effects , , , , , and , respectively.
3 Statistical testing
Using positive definite kernels, we treat each gene-derived SNP, ROI, and gene-derived DNA methylation as a testing unit. In the following subsections, we study the test statistic of the overall effect and higher order interaction effects.
3.1 Testing overall effect
We known that the overall testing effect is equivalent to test the variance components in Eq.(13), .
Unfortunately, under the null hypothesis, the asymptotic distribution of a likelihood ratio test (LRT) statistic does not follow a chi-square distribution or a mixture chi-square distribution. Because the parameters in the variance components analysis are laid on the boundary of the parameter space when the null hypothesis is true and kernel matrices are not block-diagonal, S. Li and Cui (2012) have proposed a score test statistic based on the restricted likelihood. In this paper, we have constructed a score test statistic for the multi-view data model, Eq. (13
). Assuming that the linear mixed model in Eq. (13
) has multivariate normal distribution with meanand variance-covariance matrix