1 Introduction
Schizophrenia is a complex neurological disorder whose manifestation can be attributed to numerous genetic, epigenetic and environmental factors. Genomic data have been used to identify genes at risk of causing schizophrenia while brain imaging techniques such as fMRI allow researchers to locate brain regions whose abnormal behaviors correlate to symptoms of the disorder. In the interest of leveraging data to produce more powerful conclusions on the causes of schizophrenia and to improve diagnosis of this and other complex mental disorders, the integration of these datasets via canonical correlation analysis (CCA) have been employed [35].
Linear or classical CCA seeks to obtain coefficients whose projections maximize the linear correlations between two or more (in the case of multiple CCA) sets of variables. Due to its versatility in integrating data, this method has been utilized by numerous scientists and researchers in the fields of statistics, biometrics, economics, and social science. Additionally, more robust methods of performing CCA have been developed since its inception [4].
Linear CCA assumes linearity of the data being studied, however in many cases it cannot be assumed that the two or more datasets are linearly correlated. Through the use of a Reproducing Kernel Hilbert Space (RKHS), kernel methods better maximize correlation between nonlinearlycorrelated datasets and thus can be more attractive. The principle of Kernel and Multiple Kernel CCA, like classical linear CCA, is to find basis vectors by which the projection of two or more (in the case of Multiple Kernel CCA) datasets maximizes the correlation between these projections
[4]. In the case of Kernel and Multiple Kernel CCA, the datasets are projected into a Reproducing Kernel Hilbert Space, which can be of far higher dimension than the linear space into which classical CCA will project the data.In this work, the reproducing kernel is used in Kernel and Multiple Kernel CCA to maximize nonlinear correlations between imaging genomics datasets. Kernel and Multiple Kernel CCA are employed in an algorithm to classify patients as “healthy” or “schizophrenic” based on their individual SNP, DNA Methylation and fMRI voxel information.
The rest of this work is organized as follows: Section 2 will include theory on linear CCA and multiple CCA; Section 3 will include theory on the Reproducing Kernel Hilbert Space, Kernel CCA and Multiple Kernel CCA; Section 4 will detail methods of experimentation and classification, including the data used in the experiments; and Section 5 will include conclusions and future directions for the research.
2 Linear CCA and Multiple CCA
In this section, we review the method of regularized linear canonical correlation analysis (CCA) of two or more (in the case of multiple CCA) datasets. The goal of linear canonical correlation analysis is to determine the linear correlation between two or more datasets, and to develop coefficients that maximize this correlation. In mathematical terms, given sets of variables , linear CCA seeks to find vectors such that correlation between the sets , is maximized. This formulation is represented as follows:
(1) 
where is the covariance matrix of sets . Note that in the case that , Equation 1 becomes
(2) 
the classical maximization problem to be solved by linear canonical correlation analysis. As such, multiple CCA can be viewed as a generalization of CCA that accepts more than two datasets.
To search for a maximum correlation, we solve the following eigenvalue problem:
(3) 
Due to the possibility of singularity of the diagonal matrix in the specified eigenvalue problem [4], a regularization term is added as follows:
(4) 
for a small regularization parameter. With only two datasets (), the multiple CCA becomes the regular linear CCA and the eigenvalue problem becomes
(5) 
3 Kernel CCA and multiple kernel CCA
In this section, we review the single and Multiple Kernel Canonical Correlation Analysis (CCA).
3.1 Kernel CCA
The aim of Kernel CCA is to seek two sets of functions in the RKHS for which the correlation (Corr) of random variables is maximized. Given two sets of random variables
and with two functions in the RKHS, and , the optimization problem of the random variables and is(6) 
The optimizing functions and are determined up to scale.
Using a finite sample, we are able to estimate the desired functions. Given an i.i.d sample,
from a joint distribution
, by taking the inner products with elements or “parameters” in the RKHS, we have features and , where and are the associated kernel functions for and , respectively. The kernel Gram matrices are defined as and . We need the centered kernel Gram matrices and , where with and is the vector with ones. The empirical estimate of Eq. (6) is then given bywhere
where and are the directions of and , respectively. Solving the above maximization problem is then analogous to solving the eigenvalue problem in (3.1):
(7) 
Unfortunately, the naive kernelization (3.1) of CCA is trivial and nonzero solutions of generalized eigenvalue problem are [11, 21]. To overcome this problem, we introduce small regularization terms in the denominator of the right hand side of (3.1) as
(8) 
where the small regularized coefficient is .
3.2 Multiple kernel CCA
Multiple kernel CCA seeks more than two sets of functions in the RKHSs for which the correlation (Corr) of random variables is maximized. Given sets of random variables and functions in the RKHS, ,, , the optimization problem of the random variables , , is
(9) 
Given an i.i.d sample, from a joint distribution , by taking the inner products with elements or “parameters” in the RKHS, we have features
(10) 
where are the associated kernel functions for , respectively. The kernel Gram matrices are defined as , , . Similar to Section 3.1, using this kernel Gram matrices, the centered kernel Gram matrices are defined as , , , where with and is the vector with ones. As in the two sets of data the empirical estimate of Eq. (9) is obtained using the generalized eigenvalue problem, as given by following problem:
(11) 
4 Classification via Kernel CCA and Multiple Kernel CCA
To classify patients according to the binary phenotype, combinations of the datasets selected from SNPs, fMRI, and DNA methylation are integrated via the Kernel CCA and Multiple Kernel CCA algorithms. In all cases a dummy set with only classification information is used. The classification algorithm relies upon 10fold crossvalidation, in which 9 folds are used to train the classifier and the remaining fold is a test. The classification algorithm is as follows:

Define datasets , to use for classification algorithm. Remove test sets and obtain training sets which will be denoted by for simplicity.

Use Kernel CCA or Multiple Kernel CCA to obtain functional coefficient vectors and construct sets of matrices and for and the centered Gram matrices for the training and testing set, respectively.

Classify via kmeans trained on , and obtain error of classification by checking against actual phenotype labels.

Perform algorithm in tenfold crossvalidation, averaging errors after ten folds.
Dataset Combination  Kernel CCA % error  CCA % error 

SNP  45.3552  49.7268 
fMRI  39.8907  42.6230 
DNA Methylation  30.0546  46.9945 
SNP, fMRI  35.5191  37.7049 
DNA Methylation, fMRI  27.3224  37.7049 
SNP, DNA Methylation  31.694  45.3552 
SNP, DNA Methylation, fMRI  30.6011  43.1694 
5 Experiments
5.1 Imaging Genomics Data
The Mind Clinical Imaging Consortium (MCIC) has collected three types of data (SNPs, fMRI and DNA methylation) from 208 subjects including schizophrenic patients (age: , females) and (age: , females) healthy controls. Without missing data, the number of subjects is ( schizophrenia (SZ) patients and healthy controls)[35].
SNPs: For each subject (SZ patients and healthy controls) a blood sample was taken and DNA was extracted. All subject genes typing was performed at the Mind Research Network using the Illumina Infinium HumanOmni1 Quad assay covering SNP loci. To form the final genotype calls and to perform a series of standard quality control procedures bead studio and PLINK software packages were applied, respectively. Additionally, those SNPS that could not be mapped cleanly to genes using the Scandb gene mapping resource were thrown out. The final dataset spans loci having genes based on subjects (those without missing data). Genotypes “aa” (nonminor allele), “Aa” (one minor allele) and “AA” (two minor alleles) were coded as , and for each SNP, respectively [35] [34].
fMRI: Participants’ fMRI data was collected during their block design motor response to auditory stimulation. Stateoftheart approaches use mainly Participants’ feedback and experts’ observations for this purpose. The aim was to continuously monitor the patients, acquiring images with parameters (TR=2000 ms, TE= 30ms, field of view=22cam, slice thickness=4mm, 1 mm skip, 27 slices, acquisition matrix , flip angle =) on a Siemens3T Trio Scanner and 1.5 T Sonata with echoplanar imaging (EPI). Data were preprocessed with SPM5 software and were realigned spatially normalized and resliced to mm. It was smoothed with a Gaussian kernel and analyzed by multiple regression considering the stimulus and their temporal derivatives plus an intercept term as repressors . Finally the stimuluson versus stimulusoff contrast images were extracted with mission measurements, excluding voxels without measurements. voxels were extracted from ROIs based on the aal brain atlas for analysis [35].
DNA methylation:DNA methylation is one of the main epigenetic mechanisms to regulate gene expression. It appears to be involved in the development of schizophrenia. In this paper, we investigated DNA methylation markers in blood from schizophrenia patients and healthy controls. Participants come from the MCIC, a collaborative effort of 4 research sites. For more details, site information and enrollment for schizophrenia patients and healthy controls are in [36]. All participants’ symptoms were evaluated by the Scale of the Assessment of Positive Symptoms and the Scale of the Assessment of Negative symptoms [15]. DNA from blood samples was measured by the Illumina Infinium Methylation27 Assay. The methylation value is calculated by taking the ratio of the methylated probe intensity and the total probe intensity.
5.2 Results
All combinations of the three datasets (including only one of each dataset) are used for classification via the algorithm detailed in Section 4. Table 1 depicts % error of Kernel CCA classification. Linear CCA is also used to classify the patients for the comparison of the two methods.
From these results it is evident that the Kernel and Multiple Kernel CCA classification algorithms are significantly more accurate than the Linear and Multiple CCA for this dataset. When using only one of the three datasets, both algorithms achieve lowest accuracy when using SNPs for classification, while the Kernel CCA has maximum accuracy using DNA methylation and the Linear CCA has maximum accuracy using fMRI. It can be easily seen that both Multiple Kernel CCA and Multiple CCA achieve global maximum accuracy by using DNA methylation and fMRI together, while adding SNPs to the mix increases error.This seems straightforward given that both methods experience global minimum accuracy using SNPs only.
For the Kernel and Multiple Kernel CCA classification algorithms, using two datasets appears to achieve higher classification accuracy than using a single dataset, except for the case of combining SNPs with DNA Methylation, in which the SNP data appears to throw off the classification. Again, this loss of accuracy due to incorporation of the SNP data comes as little surprise given the low accuracy of the classifier when using SNPs alone. Overall, the results conclude that classification via Kernel and Multiple Kernel CCA achieves highest accuracy with the use of fMRI and DNA Methylation combined, and that the best dataset to use alone is DNA Methylation. Adding SNPs to the sets used for classification will likely decrease the accuracy. The highest accuracy achieved by the classifier is approximately 72.6%, with Kernel CCA using the combination of fMRI and DNA methylation data.
6 Conclusion and Future Work
It is apparent that for classification of patients in a binary phenotype of “healthy” versus “schizophrenic,” Kernel and Multiple Kernel CCA far surpasses linear CCA in accuracy. Both methods achieve maximal accuracy utilizing the combination of DNA methylation and fMRI, and achieve minimal accuracy classifying only on SNPs. From these results it is possible to conclude that in some cases the integration of multiple data modalities may yield higher accuracies in classification of complex neurological diseases such as schizophrenia. In this case, the combination of imaging and epigenetic factors produce better results than the incorporation of SNP variations.
It also appears that in this case the nonlinear correlations between datasets produce more easilyseparable and thus classifiable components than linear CCA, as evidenced by the Kernel and Multiple Kernel CCA’s superior accuracy to linear CCA. Both CCA and Kernel CCA serve as a feature extraction tool, based on which the classifier is used to separate patients from healthy controls. It appears Kernel and Multiple Kernel CCA can better reveal the relationship of three datasets and the best combination is fMRI and methylation. This work also demonstrates that the projection coefficients on the variants can serve as a distinct feature for classification.
For future work, parameter optimization must be employed to perfect results of the classification algorithm. Such parameters include kernel type, value for kmeans clustering, and number of components that were used in the Kernel and Multiple Kernel CCA step. Actual changes may be made to the classification algorithm, such as using SVM instead of kmeans classification. Given the complex nature of schizophrenia as a neurological disorder, the phenotype is complex and multidimensional. As such, in future work patients will be classified using a higherdimensional phenotype space.
Acknowledgments
The authors wish to thank the NIH (R01 GM109068, R01 MH104680, R01 MH107354) and NSF (1539067) for support.
References
 [1] Mathworks inc., matlab 7.0 release 14 help : Statistics toolbox. 2005.

[2]
J.G. Adrover and S. M. donato.
A robust predictive approach for canonical correlation analysis.
Journal of Multivariate Analysis.
, 133:356–376, 2015.  [3] S. Akaho. A kernel method for canonical correlation analysis. International meeting of psychometric Society., 35:321–377, 2001.
 [4] M. A. Alam. Kernel Choice for Unsupervised Kernel Methods. PhD. Dissertation, The Graduate University for Advanced Studies, Japan, 2014.
 [5] M. A. Alam and K. Fukumizu. Kernel and feature search in kernel PCA. IEICE Technical Report, IBISML201149, 111 (275):47–56, 2011.

[6]
M. A. Alam and K. Fukumizu.
Higherorder regularized kernel CCA.
12th International Conference on Machine Learning and Applications
, pages 374–377, 2013.  [7] M. A. Alam and K. Fukumizu. Hyperparameter selection in kernel principal component analysis. Journal of Computer Science, 10(7):1139–1150, 2014.

[8]
M. A. Alam and K. Fukumizu.
Higherorder regularized kernel canonical correlation analysis.
International Journal of Pattern Recognition and Artificial Intelligence
, 29(4):1551005(1–24), 2015.  [9] M. A. Alam, K. Fukumizu, and Y.P. Wang. Robust Kernel (Cross) Covariance Operators in Reproducing Kernel Hilbert Space toward Kernel Methods. ArXiv eprints, February 2016.
 [10] M. A. Alam, M. Nasser, and K. Fukumizu. Sensitivity analysis in robust and kernel canonical correlation analysis. 11th International Conference on Computer and Information Technology, Bangladesh., IEEE:399–404, 2008.
 [11] M. A. Alam, M. Nasser, and K. Fukumizu. A comparative study of kernel and robust canonical correlation analysis. Journal of Multimedia., 5:3–11, 2010.

[12]
M. A. Alam and Y.P. Wang.
Influence Function of Multiple Kennel Canonical Analysis to Identify Outlier in Imaging Genetics Data.
ArXiv eprints.  [13] C. Alzate and J. A. K. Suykens. A regularized kernel CCA contrast function for ICA. Neural Networks, 21:170–181, 2008.
 [14] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley& Sons, third edition, 2003.
 [15] NC. Andreasen. Scale for the assessment of positive symptoms (SAPS). Springer, Iowa City, University of Iowa, 1984.
 [16] N. Andréasson, A. Evgrafov, and M. Patriksson. An Introduction to Continuous Optimization: Foundations and Fundamental Algorithms. Studentlitteratur, 2007.

[17]
P. Arias, G. Arias, and G. Sapiro.
Connecting the outofsample and preimage problems in kernel
methods.
In IEEE Computer Society conference on computer vision and pattern recognition
, pages 1–8, 2007.  [18] S. Arlot. A survey of crossvalidation procedures for model selection. Statistics Surveys, 4:40–79, 2010.
 [19] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
 [20] F. R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008.

[21]
F. R. Bach and M. I. Jordan.
Kernel independent component analysis.
Journal of Machine Learning Research, 3:1–48, 2002.  [22] K. Bache and M. Lichman. UCI machine learning repository, 2013.
 [23] G. Bakir, J. Weston, and B. Schölkopf. Learning to find preimages. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in Neural Information Processing Systems, 16:449–456, 2004.
 [24] K. Barnard, P. Duygulu, N. Freitas, D. Blei D. Forsyth, and M. I. Jordan. Kernel independent component analysis. Matching Words and Pictures, 3:1107–1135, 2003.

[25]
A. Berlinet and C. ThomasAgnan.
Reproducing kernel Hilbert spaces in probability and statistics
. Kluwer Academic Publishers, London, 2004.  [26] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.

[27]
B. E. Boser, I. M. Guyon, and V. N. Vapnik.
A training algorithm for optimal margin classifiers.
In D. Haussler, editor,
Fifth Annual ACM Workshop on Computational Learning Theory
, pages 144–152, Pittsburgh, PA, 1992. ACM Press.  [28] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
 [29] A. J. Branco, P. Filzmose C. Croux, and M. R. Oliviera. Robust canonical correlations: A comparative study. Computational Statistics, 20:203–229, 2005.
 [30] L. Breiman and T. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80(391):580–598, 1985.
 [31] J. Bruin. newtest: command to compute new test @ONLINE, February 2011.
 [32] J. Cardoso. Dependence, correlation and gaussianity in independent component analysis. Journal of Machine Learning Research, 4:1177–1203, 2003.
 [33] B. Chang, U. Kruger, R. Kustra, and J. Zhang. Canonical correlation analysis based on hilbertschmidt independence criterion and centered kernel target alignment. Proceedings of the th International Conference on Ma chine Learning, Atlanta, Georgia, USA, 2013.
 [34] X. Chen and H. Liu. An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping. Statistics in BioSciences, 2012.
 [35] D. Lin, V. D. Callhoun, and Y. P. Wang. Correspondence between fmri and snp data by group sparse canonical correlation analysis. Medical Image Analysis, 18:891 – 902, 2014.
 [36] J. Liu, J. Chen, S. Ehrlich, E.Walton, T. White N. P.Bizzozero, J. Bustillo, J. A. Turner, and V. D. Calhoun. Methylation patterns in whole blood correlate with symptoms in schizophrenia patients. Schizophrenia Bulletin, 40(4):769–776, 2014.