In this study, we focus on Alzheimer’s disease (AD) as outcome of interest, which is a progressive neurodegenerative disease, appears late-onset and sporadic in most cases, and is the main cause of dementia in the elderly. As the cognitive symptoms emerge years after the appearance of brain atrophy and exhibit close correlation with the structural changes, brain magnetic resonance imaging (MRI) scans provide a direct way to obtain informative quantitative traits, and fast automated approaches are necessary for large-scale studies. AD has a high estimated heritability of 74%gatz1997 and a prevalence of 4.4% in Europe lobo2000. However, the biological pathways underlying AD have not been well-understood and there is yet no known cure. Hence, the identification of AD markers for early detection and as targets for treatment is important.
For the detection of causal genetic loci, recent sequencing efforts allow in-depth analyses of rare variants in large cohorts, and kernel-based gene-level tests have been proposed for the analysis wu2011rare; lee2012optimal; listgarten2013powerful; lippert2014greater; urrutia2015
. They derive similarity scores between samples in the form of a kernel matrix which is computed on a particular genomic locus or functional unit in the genome. Then, kernel-based variance-component test statistics are derived that yield robust and powerful tests. Kernel functions provide a highly flexible way to model genetic variation. However, their full capabilities have not been leveraged and existing approaches still provide suboptimal performance for the analysis of sequencing datakonigorski2017comparison, where the overwhelming majority of genetic variation is extremely rare. Hence extensions to the existing methods are warranted that leverage the full power of kernels to aggregate the signal of very rare single nucleotide variants (SNVs).
Our contributions in this paper are in two areas. First, we use a convolutional neural network (CNN) to derive quantitative traits from MRI scans, and evaluate if precise traits are obtained on data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), where also traits obtained by the popular yet computationally expensive FreeSurfer software fischl2002 are available. Second, we propose novel kernels for association tests of rare genetic variants that incorporate prior biological knowledge from annotations and multi-omics measures. We perform association analyses between these novel kernels computed on sequencing data and CNN-derived traits to identify genetic loci associated with AD.
1.1 Related work
The association of a set of genetic markers with a quantitative trait with
observations can be tested in a linear mixed model of the form
where is a covariate design matrix with fixed effects , are random effects of the SNVs in design matrix accounting for population stratification,
is the identity matrix,are random effects of the SNVs of interest in the design matrix , and are error terms. Hence
where the kernel matrix describes the similarity between individuals based on the SNVs of interest. The association of the SNVs (i.e. or vs. ) can be tested using score or likelihood ratio tests. Binary or count traits can be analyzed similarly.
Popular kernel-based tests include FaST-LMM-Set listgarten2013powerful; lippert2014greater, the sequence kernel association test (SKAT, wu2011rare) and optimal SKAT (SKAT-O, lee2012optimal), which are based on weighted linear kernels wu2011rare; listgarten2013powerful; lippert2014greater, or a linear combination of weighted linear and collapsing kernels lee2012optimal. Newer approaches urrutia2015 derive further data-adaptive combinations of linear, quadratic, IBS, and collapsing kernels. However, all these kernels provide suboptimal performance for the analysis of very rare genetic variants. Here, linear and quadratic kernels yield uninformative similarity measures (i.e., diagonal kernel matrices for singletons, which are variants with only one observed copy of the minor allele) and collapsing kernels often yield unspecific signals and aggregate noise.
2 New kernel-based tests for very rare genetic variants
To leverage the full power of kernels computing similarities in high-dimensional Hilbert space, whereto genetic variants are mapped through a potentially infinite-dimensional basis function , we consider the more general linear mixed model
are normally distributed random effects. After integrating out, and , it follows that is normally distributed with covariance , where we have defined the kernel matrix . In this model, established tests lee2012optimal; listgarten2013powerful; lippert2014greater can be used to test the association between sets of SNVs and the phenotype, see Figure 1 for an illustration.
2.1 Examples of new kernels
Let be the matrix of the SNVs of interest. We define a class of kernel matrices as
where different instances are obtained by setting the weight and similarity matrices , to the identity, to the matrices outlined below, or any combination of these. See Appendix A for details.
Set where is the numeric matrix encoding characteristics of the SNVs, such as the minor allele frequency (MAF), genomic position, or functional annotations from PolyPhen2 adzhubei2010, RegulomeDB boyle2012, or others. Set the elements of to (i) describe the similarity of SNVs and in terms of genomic closeness, or (ii) indicate whether SNVs and have a (or the same) functional annotation.
Incorporate information from available omics data
Set where is the matrix (i) containing -log p-values of association tests of the SNVs with omics data e.g. gene expression levels of genes or (ii) indicating for each of these p-values if they are , where is pre-specified constant. Set the elements of to be indicators whether SNVs and both have p-value .
3 Application: analysis of ADNI study
In the application, we analyzed whole-genome-sequencing data, gene expression measures, MRI data as well as AD biomarkers in
participants from ADNI, which is a longitudinal study to detect biomarkers and risk factors for ADweiner2010; weiner2012.
In a first step, we designed a 3-dimensional CNN comprising seven convolutional layers followed by a max pooling layer and a final fully-connected layer to predict the volume of the 3ventricle from the MRI scans (see Figure 2 for an illustration and see Appendix B, Figure C for details). To evaluate the approach, we chose the 3 ventricle, as we found that the ventricular regions were displayed with a higher contrast and presumably easier to identify. The CNN predicted volume was then used as a quantitative trait in the following genetic association analyses, and evaluated against the predictions by the FreeSurfer software. Both models where trained on a dual Intel Xeon 6148 workstation equipped with an NVidia Titan-V graphics card.
In the main genetic association analysis we analyzed 17,013 (quality-controlled, biallelic, missingness 5%, of any MAF) SNVs in 125 genes in the 1Mbp region around the APOE gene on chromosome 19, similar to the study in nho2017, to investigate rare variants in a genomic region where several common variants have been associated with AD. We performed cross-sectional association tests of these 125 genes with 9 different AD traits (peptides CSF A, t-tau, p-tau, and the provided brain volumes of entorhinal cortex, hippocampus, medial temporal lobe, ventricles, 3 ventricle predicted by FreeSurfer, and 3 ventricle predicted from the CNN) adjusting for the covariates age, gender, education, ethnicity, and APOE4 allele. The association tests were performed based on different combinations of the new proposed kernels (in Appendix A) and using standard SKAT and SKAT-O.
The 125 genes contained on average 220 SNVs (, ). Of the 17,013 SNVs, 7575 were singletons, 1740 doubletons, and 12,337 SNVs had MAF . 24 participants had dementia, 338 mild cognitive impairment, 194 were cognitive normal (see Table C
for descriptive statistics).
In an evaluation of the predicted volume of the 3 ventricle, CNN and FreeSurfer predictions showed a high correlation (Pearson , see Figure C). For small/large volumes, compared to FreeSurfer, CNN slightly over-/underestimated the volume, which we expect to disappear with larger training data. On the other hand, CNN was much faster (1 second versus 16 hours per scan).
In the main genetic association analyses, a first comparison showed that analyses using the CNN-predicted trait as outcome generally yielded similar and often smaller p-values compared to the FreeSurfer-predicted trait (Figures C-C). Preliminary comparisons of all new kernels indicated that the three kernels reported in Table 1 yielded often the smallest p-values in gene-based tests, hence they are reported here. Tests based on the new kernel 1 yielded consistently smaller or similar p-values for the top genes compared to SKAT and SKAT-O for 8 out of 9 traits (Table 1). More detailed comparisons (Figure C) indicated that while often the same genes were identified with smallest p-value by tests based on the new kernel 1 and by SKAT or SKAT-O, the new kernel 1 also yielded different candidate genes that would not have been identified by SKAT or SKAT-O (and vice versa). The new kernels 2 and 3 yielded sometimes larger but also sometimes much smaller p-values.
Using a Bonferroni correction (for the 125 tests) of the p-values of the new kernel-based tests, we identified 3 candidate genes for AD with adjusted p-values 0.007, 0.05, 0.07: PVR for CSF t-tau, SIX5 for entorhinal cortex and PVRL2 for hippocampus.
|Trait||SKAT-O||SKAT||New Kernel 1||New Kernel 2||New Kernel 3|
|FreeS 3 Ventricle||6.8||6.5||8.0||3.4||5.1|
|CNN 3 Ventricle||1.9||5.9||5.9||1.7||2.0|
The empirical analyses indicated that (i) CNNs provide a precise, fast and scalable tool to derive quantitative traits from MRI scans and that (ii) new kernels integrating domain knowledge and omics data constitute a promising approach for the analysis of very rare variants. There is previous evidence for the association of the identified genes with AD Marioni2018; kwok2018; Hao2018; Beecham2014 to support our findings, and of note, the p-values are much smaller using the new kernels here compared to regular kernels nho2017. Limitations of the current analyses are that only few functional annotations are available for rare SNVs, and that only a basic control for population stratification was used. In the interpretation of the results regarding their biological relevance, it can be noted that the analyses were adjusted for the risk factor APOE4, so that the identified genes and SNVs represent markers with independent effects on AD. Future research can investigate kernels measuring the similarity between the bivariate allelic sequences directly, and data-adaptive optimal combinations of different kernels.
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database adni.loni.usc.edu. As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. Data collection and sharing of ADNI was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; BioClinica Inc; Biogen Idec Inc; Bristol-Myers Squibb Company; Eisai Inc; Elan Pharmaceuticals Inc; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech Inc; GE Healthcare; Innogenetics N.V.; IXICO Ltd; Janssen Alzheimer Immunotherapy Research & Development LLC; Johnson & Johnson Pharmaceutical Research & Development LLC; Medpace Inc; Merck & Co Inc; Meso Scale Diagnostics LLC; NeuroRx Research; Novartis Pharmaceuticals Corporation; Pfizer Inc; Piramal Imaging; Servier; Synarc Inc; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (http://www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Samples from the National Cell Repository for AD (NCRAD), which receives government support under a cooperative agreement grant (U24 AG21886) awarded by the National Institute on Aging (AIG), were used in this study. Funding for the WGS was provided by the Alzheimer’s Association and the Brin Wojcicki Foundation.
A Details on new kernels
Let be the number of observations and the number of SNVs of interest. Define the kernel as in equation (4) by setting the matrices and to or the following.
Consider the weight matrices where is
a vector where each entry is an indicator whether SNV has a functional annotation in, for example, the PolyPhen2 database.
a vector where each entry is a numeric encoding of the functional annotation of SNV , for example, in the PolyPhen2 database:
a vector where each entry is the -log p-value from a hypothesis test of the association between SNV and a variable providing relevant information about its biological function, e.g. where is the gene expression of the gene in which the SNV lies.
a vector where each entry is the sum of 1 and an indicator variable whether SNV is associated with a variable providing relevant information about its biological function as in the bullet point above, e.g. evaluating whether the p-value from a hypothesis test of the association between SNV with a variable is smaller than 0.05.
the Hadamard product of the vectors in bullet points (1 and 4) or (1 and 5).
the sum of the vectors in bullet points (2 or 3) and (4 or 5).
the sum of the vectors in bullet points (2 or 3) and 6.
Consider the matrices describing the similarity of SNVs where
similarity of SNVs and in terms of genomic closeness:
where is the genomic distance between SNVs and in base pairs.
indicator whether SNVs and both have a functional annotation:
indicator whether SNVs and have the same functional annotation.
indicator whether SNVs and have p-value specified cutoff value .
where the p-value of SNV is from an association test with a variable that provides relevant information about its biological function, e.g. where is the gene expression of the gene in which the SNV lies.
is the product of the matrices in bullet points (1 and (2 or 3)), (1 and 4), or (4 and (2 or 3))
is the product of the matrices in bullet points 1 and (2 or 3) and 4.
B Details on convolutional neural networks
. We designed a CNN made of a sequence of seven convolutional layers followed by a max pooling layer and a fully-connected layer. We used two types of convolutional layers: Regular and Down-Convolution. Regular convolutional layers comprised a 3 x 3 x 3 convolutional operation with 1 x 1 x 1 strides. The down-convolutional layers comprised a 2 x 2 x 2 convolutional operation with 2 x 2 x 2 strides. Each convolutional layer was followed by a Rectified Linear Unit non-linearity[nair2010]
. After the last convolutional layer, we used a max pooling layer with a filter size of 2 x 2 x 2. Subsequently, this layer was converted into a fully-connected layer, followed by the output layer containing a single node with a linear activation function.
The MRI scans were standardized to the spatial resolution of 1 x 1 x 1 millimeters and the size of 256 x 256 x 256 voxels. Additionally, for computational efficiency, they were cropped and down-sampled to 96 x 109 x 96 voxels.
Hyperparameter tuning was carried out on a validation dataset comprising 550 scans of 129 subjects that all had MRI data but did not have genetic data available so that they could not be included in the main analysis. The final evaluation was done on the test set including the 556 subjects of the main analysis that had all MRI, genetic, and gene expression data available. The model performance on the test set is visualized in Figure C.
Computational comparison with FreeSurfer
Both models where trained on a dual 20 core Intel Xeon 6148 workstation with 768GB RAM equipped with an NVidia Titan-V graphics card. CNN computations made use of GPU optimization, taking 1 second for the prediction of the volume of the third ventricle per MRI scan. FreeSurfer, which did not utilize the GPU, took 16 hours per MRI scan.
C Supplementary tables and figures
[!h] Measures Descriptive Statistics Sample size 556 Age, years 72.9 (7.0) Gender female 250 male 306 Ethnic group hispanic/latino 10 not hispanic/latino 545 unknown 1 Education length, years 16.1 (2.8) APOE 4 allele homozygot minor allele 331 heterozygot 186 homozygot major allele 39 Cognitive status cognitive normal 194 mild cognitive impairment 338 dementia 24