Log In Sign Up

Integrating omics and MRI data with kernel-based tests and CNNs to identify rare genetic markers for Alzheimer's disease

by   Stefan Konigorski, et al.
Hasso Plattner Institute
Max Delbrück Center for Molecular Medicine

For precision medicine and personalized treatment, we need to identify predictive markers of disease. We focus on Alzheimer's disease (AD), where magnetic resonance imaging scans provide information about the disease status. By combining imaging with genome sequencing, we aim at identifying rare genetic markers associated with quantitative traits predicted from convolutional neural networks (CNNs), which traditionally have been derived manually by experts. Kernel-based tests are a powerful tool for associating sets of genetic variants, but how to optimally model rare genetic variants is still an open research question. We propose a generalized set of kernels that incorporate prior information from various annotations and multi-omics data. In the analysis of data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), we evaluate whether (i) CNNs yield precise and reliable brain traits, and (ii) the novel kernel-based tests can help to identify loci associated with AD. The results indicate that CNNs provide a fast, scalable and precise tool to derive quantitative AD traits and that new kernels integrating domain knowledge can yield higher power in association tests of very rare variants.


page 11

page 12

page 13

page 14


rfPhen2Gen: A machine learning based association study of brain imaging phenotypes to genotypes

Imaging genetic studies aim to find associations between genetic variant...

Supervised Heterogeneous Multiview Learning for Joint Association Study and Disease Diagnosis

Given genetic variations and various phenotypical traits, such as Magnet...

Fast permutation tests and related methods, for association between rare variants and binary outcomes

In large scale genetic association studies, a primary aim is to test for...

Generalized Similarity U: A Non-parametric Test of Association Based on Similarity

Second generation sequencing technologies are being increasingly used fo...

Bayesian nonparametric strategies for power maximization in rare variants association studies

Rare variants are hypothesized to be largely responsible for heritabilit...

A powerful and efficient set test for genetic markers that handles confounders

Approaches for testing sets of variants, such as a set of rare or common...

Disease Knowledge Transfer across Neurodegenerative Diseases

We introduce Disease Knowledge Transfer (DKT), a novel technique for tra...

1 Introduction

In this study, we focus on Alzheimer’s disease (AD) as outcome of interest, which is a progressive neurodegenerative disease, appears late-onset and sporadic in most cases, and is the main cause of dementia in the elderly. As the cognitive symptoms emerge years after the appearance of brain atrophy and exhibit close correlation with the structural changes, brain magnetic resonance imaging (MRI) scans provide a direct way to obtain informative quantitative traits, and fast automated approaches are necessary for large-scale studies. AD has a high estimated heritability of 74%

gatz1997 and a prevalence of 4.4% in Europe lobo2000. However, the biological pathways underlying AD have not been well-understood and there is yet no known cure. Hence, the identification of AD markers for early detection and as targets for treatment is important.

For the detection of causal genetic loci, recent sequencing efforts allow in-depth analyses of rare variants in large cohorts, and kernel-based gene-level tests have been proposed for the analysis wu2011rare; lee2012optimal; listgarten2013powerful; lippert2014greater; urrutia2015

. They derive similarity scores between samples in the form of a kernel matrix which is computed on a particular genomic locus or functional unit in the genome. Then, kernel-based variance-component test statistics are derived that yield robust and powerful tests. Kernel functions provide a highly flexible way to model genetic variation. However, their full capabilities have not been leveraged and existing approaches still provide suboptimal performance for the analysis of sequencing data

konigorski2017comparison, where the overwhelming majority of genetic variation is extremely rare. Hence extensions to the existing methods are warranted that leverage the full power of kernels to aggregate the signal of very rare single nucleotide variants (SNVs).

Our contributions in this paper are in two areas. First, we use a convolutional neural network (CNN) to derive quantitative traits from MRI scans, and evaluate if precise traits are obtained on data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), where also traits obtained by the popular yet computationally expensive FreeSurfer software fischl2002 are available. Second, we propose novel kernels for association tests of rare genetic variants that incorporate prior biological knowledge from annotations and multi-omics measures. We perform association analyses between these novel kernels computed on sequencing data and CNN-derived traits to identify genetic loci associated with AD.

1.1 Related work

The association of a set of genetic markers with a quantitative trait with

observations can be tested in a linear mixed model of the form


where is a covariate design matrix with fixed effects , are random effects of the SNVs in design matrix accounting for population stratification,

is the identity matrix,

are random effects of the SNVs of interest in the design matrix , and are error terms. Hence


where the kernel matrix describes the similarity between individuals based on the SNVs of interest. The association of the SNVs (i.e. or vs. ) can be tested using score or likelihood ratio tests. Binary or count traits can be analyzed similarly.

Popular kernel-based tests include FaST-LMM-Set listgarten2013powerful; lippert2014greater, the sequence kernel association test (SKAT, wu2011rare) and optimal SKAT (SKAT-O, lee2012optimal), which are based on weighted linear kernels wu2011rare; listgarten2013powerful; lippert2014greater, or a linear combination of weighted linear and collapsing kernels lee2012optimal. Newer approaches urrutia2015 derive further data-adaptive combinations of linear, quadratic, IBS, and collapsing kernels. However, all these kernels provide suboptimal performance for the analysis of very rare genetic variants. Here, linear and quadratic kernels yield uninformative similarity measures (i.e., diagonal kernel matrices for singletons, which are variants with only one observed copy of the minor allele) and collapsing kernels often yield unspecific signals and aggregate noise.

2 New kernel-based tests for very rare genetic variants

To leverage the full power of kernels computing similarities in high-dimensional Hilbert space, whereto genetic variants are mapped through a potentially infinite-dimensional basis function , we consider the more general linear mixed model


Here, and

are normally distributed random effects. After integrating out

, and , it follows that is normally distributed with covariance , where we have defined the kernel matrix . In this model, established tests lee2012optimal; listgarten2013powerful; lippert2014greater can be used to test the association between sets of SNVs and the phenotype, see Figure 1 for an illustration.

Figure 1: Illustration of association tests using kernel maps.

2.1 Examples of new kernels

Let be the matrix of the SNVs of interest. We define a class of kernel matrices as


where different instances are obtained by setting the weight and similarity matrices , to the identity, to the matrices outlined below, or any combination of these. See Appendix A for details.

Incorporate annotations

Set where is the numeric matrix encoding characteristics of the SNVs, such as the minor allele frequency (MAF), genomic position, or functional annotations from PolyPhen2 adzhubei2010, RegulomeDB boyle2012, or others. Set the elements of to (i) describe the similarity of SNVs and in terms of genomic closeness, or (ii) indicate whether SNVs and have a (or the same) functional annotation.

Incorporate information from available omics data

Set where is the matrix (i) containing -log p-values of association tests of the SNVs with omics data e.g. gene expression levels of genes or (ii) indicating for each of these p-values if they are , where is pre-specified constant. Set the elements of to be indicators whether SNVs and both have p-value .

3 Application: analysis of ADNI study

Figure 2: Overview of 3D convolutional neural network.

In the application, we analyzed whole-genome-sequencing data, gene expression measures, MRI data as well as AD biomarkers in

participants from ADNI, which is a longitudinal study to detect biomarkers and risk factors for AD

weiner2010; weiner2012.

In a first step, we designed a 3-dimensional CNN comprising seven convolutional layers followed by a max pooling layer and a final fully-connected layer to predict the volume of the 3

ventricle from the MRI scans (see Figure 2 for an illustration and see Appendix B, Figure C for details). To evaluate the approach, we chose the 3 ventricle, as we found that the ventricular regions were displayed with a higher contrast and presumably easier to identify. The CNN predicted volume was then used as a quantitative trait in the following genetic association analyses, and evaluated against the predictions by the FreeSurfer software. Both models where trained on a dual Intel Xeon 6148 workstation equipped with an NVidia Titan-V graphics card.

In the main genetic association analysis we analyzed 17,013 (quality-controlled, biallelic, missingness 5%, of any MAF) SNVs in 125 genes in the 1Mbp region around the APOE gene on chromosome 19, similar to the study in nho2017, to investigate rare variants in a genomic region where several common variants have been associated with AD. We performed cross-sectional association tests of these 125 genes with 9 different AD traits (peptides CSF A, t-tau, p-tau, and the provided brain volumes of entorhinal cortex, hippocampus, medial temporal lobe, ventricles, 3 ventricle predicted by FreeSurfer, and 3 ventricle predicted from the CNN) adjusting for the covariates age, gender, education, ethnicity, and APOE4 allele. The association tests were performed based on different combinations of the new proposed kernels (in Appendix A) and using standard SKAT and SKAT-O.

3.1 Results

The 125 genes contained on average 220 SNVs (, ). Of the 17,013 SNVs, 7575 were singletons, 1740 doubletons, and 12,337 SNVs had MAF . 24 participants had dementia, 338 mild cognitive impairment, 194 were cognitive normal (see Table C

for descriptive statistics).

In an evaluation of the predicted volume of the 3 ventricle, CNN and FreeSurfer predictions showed a high correlation (Pearson , see Figure C). For small/large volumes, compared to FreeSurfer, CNN slightly over-/underestimated the volume, which we expect to disappear with larger training data. On the other hand, CNN was much faster (1 second versus 16 hours per scan).

In the main genetic association analyses, a first comparison showed that analyses using the CNN-predicted trait as outcome generally yielded similar and often smaller p-values compared to the FreeSurfer-predicted trait (Figures C-C). Preliminary comparisons of all new kernels indicated that the three kernels reported in Table 1 yielded often the smallest p-values in gene-based tests, hence they are reported here. Tests based on the new kernel 1 yielded consistently smaller or similar p-values for the top genes compared to SKAT and SKAT-O for 8 out of 9 traits (Table 1). More detailed comparisons (Figure C) indicated that while often the same genes were identified with smallest p-value by tests based on the new kernel 1 and by SKAT or SKAT-O, the new kernel 1 also yielded different candidate genes that would not have been identified by SKAT or SKAT-O (and vice versa). The new kernels 2 and 3 yielded sometimes larger but also sometimes much smaller p-values.

Using a Bonferroni correction (for the 125 tests) of the p-values of the new kernel-based tests, we identified 3 candidate genes for AD with adjusted p-values 0.007, 0.05, 0.07: PVR for CSF t-tau, SIX5 for entorhinal cortex and PVRL2 for hippocampus.

Trait SKAT-O SKAT New Kernel 1 New Kernel 2 New Kernel 3
CSF t-tau 9.1 5.8 5.2 6.4 4.0
CSF p-tau 1.5 9.3 8.5 1.2 1.9
CSF A 4.9 9.9 4.9 2.1 1.7
Entorhinal cortex 7.0 3.3 4.2 1.6 3.4
Hippocampus 6.6 3.7 3.8 5.6 2.5
Med-temporal lobe 1.1 3.7 1.4 2.3 4.6
Ventricles 1.5 9.3 1.0 9.6 5.5
FreeS 3 Ventricle 6.8 6.5 8.0 3.4 5.1
CNN 3 Ventricle 1.9 5.9 5.9 1.7 2.0
Table 1: Minimum p-values from the 125 association tests (of the 125 genes) for each respective trait and test. Tests are based on SKAT, SKAT-O and the new kernels 1 (identity , MAF + omics ), 2 (genomic distance + omics , omics ), 3 (PolyPhen2 + omics , PolyPhen2 + omics + MAF ), testing each trait and gene separately. For each trait (row), the smallest p-value is indicated in red.

4 Discussion

The empirical analyses indicated that (i) CNNs provide a precise, fast and scalable tool to derive quantitative traits from MRI scans and that (ii) new kernels integrating domain knowledge and omics data constitute a promising approach for the analysis of very rare variants. There is previous evidence for the association of the identified genes with AD Marioni2018; kwok2018; Hao2018; Beecham2014 to support our findings, and of note, the p-values are much smaller using the new kernels here compared to regular kernels nho2017. Limitations of the current analyses are that only few functional annotations are available for rare SNVs, and that only a basic control for population stratification was used. In the interpretation of the results regarding their biological relevance, it can be noted that the analyses were adjusted for the risk factor APOE4, so that the identified genes and SNVs represent markers with independent effects on AD. Future research can investigate kernels measuring the similarity between the bivariate allelic sequences directly, and data-adaptive optimal combinations of different kernels.



Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: Data collection and sharing of ADNI was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; BioClinica Inc; Biogen Idec Inc; Bristol-Myers Squibb Company; Eisai Inc; Elan Pharmaceuticals Inc; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech Inc; GE Healthcare; Innogenetics N.V.; IXICO Ltd; Janssen Alzheimer Immunotherapy Research & Development LLC; Johnson & Johnson Pharmaceutical Research & Development LLC; Medpace Inc; Merck & Co Inc; Meso Scale Diagnostics LLC; NeuroRx Research; Novartis Pharmaceuticals Corporation; Pfizer Inc; Piramal Imaging; Servier; Synarc Inc; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health ( The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Samples from the National Cell Repository for AD (NCRAD), which receives government support under a cooperative agreement grant (U24 AG21886) awarded by the National Institute on Aging (AIG), were used in this study. Funding for the WGS was provided by the Alzheimer’s Association and the Brin Wojcicki Foundation.

Supplementary material

A Details on new kernels

Let be the number of observations and the number of SNVs of interest. Define the kernel as in equation (4) by setting the matrices and to or the following.

Consider the weight matrices where is

  1. a vector with entries , where

    is the probability density function of the beta distribution with parameters 1 and 25, and

    is the minor allele frequency of SNV .

  2. a vector where each entry is an indicator whether SNV has a functional annotation in, for example, the PolyPhen2 database.

  3. a vector where each entry is a numeric encoding of the functional annotation of SNV , for example, in the PolyPhen2 database:

  4. a vector where each entry is the -log p-value from a hypothesis test of the association between SNV and a variable providing relevant information about its biological function, e.g. where is the gene expression of the gene in which the SNV lies.

  5. a vector where each entry is the sum of 1 and an indicator variable whether SNV is associated with a variable providing relevant information about its biological function as in the bullet point above, e.g. evaluating whether the p-value from a hypothesis test of the association between SNV with a variable is smaller than 0.05.

  6. the Hadamard product of the vectors in bullet points (1 and 4) or (1 and 5).

  7. the sum of the vectors in bullet points (2 or 3) and (4 or 5).

  8. the sum of the vectors in bullet points (2 or 3) and 6.

Consider the matrices describing the similarity of SNVs where

  1. similarity of SNVs and in terms of genomic closeness:

    where is the genomic distance between SNVs and in base pairs.

  2. indicator whether SNVs and both have a functional annotation:

  3. indicator whether SNVs and have the same functional annotation.

  4. indicator whether SNVs and have p-value specified cutoff value .

    where the p-value of SNV is from an association test with a variable that provides relevant information about its biological function, e.g. where is the gene expression of the gene in which the SNV lies.

  5. is the product of the matrices in bullet points (1 and (2 or 3)), (1 and 4), or (4 and (2 or 3))

  6. is the product of the matrices in bullet points 1 and (2 or 3) and 4.

B Details on convolutional neural networks

Model architecture

The model architecture is illustrated in Figure 2, and in more detail in Figure C

. We designed a CNN made of a sequence of seven convolutional layers followed by a max pooling layer and a fully-connected layer. We used two types of convolutional layers: Regular and Down-Convolution. Regular convolutional layers comprised a 3 x 3 x 3 convolutional operation with 1 x 1 x 1 strides. The down-convolutional layers comprised a 2 x 2 x 2 convolutional operation with 2 x 2 x 2 strides. Each convolutional layer was followed by a Rectified Linear Unit non-linearity


. After the last convolutional layer, we used a max pooling layer with a filter size of 2 x 2 x 2. Subsequently, this layer was converted into a fully-connected layer, followed by the output layer containing a single node with a linear activation function.

Model implementation

The MRI scans were standardized to the spatial resolution of 1 x 1 x 1 millimeters and the size of 256 x 256 x 256 voxels. Additionally, for computational efficiency, they were cropped and down-sampled to 96 x 109 x 96 voxels.

The model was trained on 2100 MRI scans (from 411 subjects) for 200 epochs with the loss function set to the mean absolute error using the Adaptive Moment Estimation optimizer

[kingma2014], a learning rate of and a 3D spatial drop out regularization of 0.9.

Hyperparameter tuning was carried out on a validation dataset comprising 550 scans of 129 subjects that all had MRI data but did not have genetic data available so that they could not be included in the main analysis. The final evaluation was done on the test set including the 556 subjects of the main analysis that had all MRI, genetic, and gene expression data available. The model performance on the test set is visualized in Figure C.

Computational comparison with FreeSurfer

Both models where trained on a dual 20 core Intel Xeon 6148 workstation with 768GB RAM equipped with an NVidia Titan-V graphics card. CNN computations made use of GPU optimization, taking 1 second for the prediction of the volume of the third ventricle per MRI scan. FreeSurfer, which did not utilize the GPU, took 16 hours per MRI scan.

C Supplementary tables and figures

[!h] Descriptive statistics of the

individuals in the analyzed sample from the ADNI study. Shown are absolute frequencies for categorical variables, and mean (standard deviation) for quantitative measures.

Measures Descriptive Statistics Sample size 556 Age, years 72.9 (7.0) Gender  female 250  male 306 Ethnic group  hispanic/latino 10  not hispanic/latino 545  unknown 1 Education length, years 16.1 (2.8) APOE 4 allele  homozygot minor allele 331  heterozygot 186  homozygot major allele 39 Cognitive status  cognitive normal 194  mild cognitive impairment 338  dementia 24

Graphical visualization of the 3D convolutional neural network model in Keras. Shown are input and output of the different layers, and the respective voxels and channels. For example, the input volume had 96

109 96 voxels and 1 channel. As all computations were done in one batch, the batch size was not specified (noted as "None" in the graph).

Scatterplot of the volume of the third ventricle prediction by FreeSurfer (x axis) and the CNN (y axis). All predictions are represented as scores. In addition, the diagonal is printed for a comparison of both predictions.

[!b] Scatterplot of the -log p-values from association tests of the 125 genes with the volume of the third ventricle predicted by FreeSurfer (x axis) and CNN (y axis) as outcome, using SKAT-O for testing. In addition, the diagonal is printed for a comparison of both tests.

[!b] Scatterplot of the -log p-values from association tests of the 125 genes with the volume of the third ventricle predicted by FreeSurfer (x axis) and CNN (y axis), using the new kernel-based test 1. In addition, the diagonal is printed for a comparison of both tests.

[!b] Scatterplot of the -log p-values from association tests of the 125 genes using SKAT (x axis) and the new kernel-based test 1 (y axis), for each of the 9 traits in separate panels. In addition, the diagonal is printed for a comparison of both tests.