1 Introduction
Alzheimer’s Disease (AD) is a severe and growing worldwide health problem. Many techniques have been developed to investigate AD, such as magnetic resonance imaging (MRI) and genomewide association studies (GWAS), which are powerful neuroimaging modalities to identify preclinical and clinical AD patients. GWAS [4] are achieving great success in finding single nucleotide polymorphisms (SNPs) associated with AD. For example, APOE is a highly prevalent AD risk gene, and each copy of the adverse variant is associated with a 3fold increase in AD risk. The Alzheimer’s Disease Neuroimaging Initiative (ADNI) collects neuroimaging and genomic data from elderly individuals across North America. However, processing and integrating genetic data across different institutions is challenging. Each institution may wish to collaborate with others, but often legal or ethical regulations restrict access to individual data, to avoid compromising data privacy.
Some studies, such as ADNI, share genomic data publicly under certain conditions, but more commonly, each participating institution may be required to keep their genomic data private, so collecting all data together may not be feasible. To deal with this challenge, we proposed a novel distributed framework, termed Local Query Model (LQM), to perform the Lasso regression analysis in a distributed manner, learning genetic risk factors without accessing others’ data. However, applying LQM for model selection—such as stability selection—can be very time consuming on a largescale data set. To speed up the learning process, we proposed a family of distributed safe screening rules (DSAFE and DEDPP) to identify irrelevant features and remove them from the optimization without sacrificing accuracy. Next, LQM is employed on the reduced data matrix to train the model so that each institution obtains top risk genes for AD by stability selection on the learnt model without revealing its own data set. We evaluate our method on the ADNI GWAS data, which contains 809 subjects with 5,906,152 SNP features, involving a 80 GB data matrix with approximate 42 billion nonzero elements, distributed across three research institutions. Empirical evaluations demonstrate a speedup of 66fold gained by DEDPP, compared to LQM without DEDPP. Stability selection results show that proposed framework ranked
APOE as the first risk SNPs among all features.2 Data processing
2.1 ADNI GWAS data
The ADNI GWAS data contains genotype information for each of the 809 ADNI participants, which consist of 128 patients with AD, 415 with mild cognitive impairment (MCI), 266 cognitively normal (CN). SNPs at approximately 5.9 million specific loci are recorded for each participant. We encode SNPs with the coding scheme in [7] and apply Minor Allele Frequency (MAF) and Genotype Quality (GQ) as two quality control criteria to filter high quality SNPs features, the details refer to [11].
2.2 Data partition
Lasso [9] is a widelyused regression technique to find sparse representations of data, or predictive models. Standard Lasso takes the form of
(1) 
where is genomic data sets distributed across different institutions,
is the response vector (e.g., hippocampus volume or disease status),
is sparse representation—shared across all institutions and is a positive regularization parameter.Suppose that we have participating institutions. For the th institution, we denote its data set by , where , is the number of subjects in this institution, is the number of features, and is the corresponding response vector, and . We assume is the same across all institutions. Our goal is to apply Lasso to rank risk SNPs of AD based on the distributed data sets , .
3 Methods
Fig. 1 illustrates the general idea of our distributed framework. Suppose that each institution maintains the ADNI genomewide data for a few subjects. We first apply the distributed Lasso screening rule to preidentify inactive features and remove them from the training phase. Next, we employ the LQM on the reduced data matrices to perform collaborative analyses across different institutions. Finally, each institution obtains the learnt model and performs stability selection to rank the SNPs that may collectively affect AD. The process of stability selection is to count the frequency of nonzero entries in the solution vectors and select the most frequent ones as the top risk genes for AD. The whole learning procedure results in the same model for all institutions, and preserves data privacy at each of them.
3.1 Local Query Model
We apply a proximal gradient descent algorithm—the Iterative Shrinkage/Thresholding Algorithm (ISTA) [2]—to solve problem (1). We define
as the least square loss function. The general updating rule of ISTA is:
(2) 
where is the iteration number, is an appropriate step size, and is the soft thresholding operator [8] defined by .
In view of (2), to solve (1), we need to compute the gradient of the loss function , which equals to . However, because the data set is distributed to different institutions, we cannot compute the gradient directly. To address this challenge, we propose a Local Query Model to learn the model across multiple institutions without compromising data privacy.
In our study, each institution maintains its own data set to preserve their privacy. To avoid collecting all data matrices together, we can rewrite the problem (1) as the following equivalent formulation: where is the least squares loss.
The key of LQM lies in the following decomposition: . We use “local institution” to denote all the institutions and “global center” to represent the place where intermediate results are calculated. The th local institution computes . Then, each local institution sends the partial gradient of the loss function to the global center. After gathering all the gradient information, the global center can compute the accurate gradient with respect to x by adding all together and send the updated gradient back to all the local institutions to compute .
The master (global center) only servers as the computation center and does not store any data sets. Although the master gets , it could not reconstruct and . Let denote the th iteration of . Suppose is initialized to be zero, and . We get by but can not be reconstructed since updating and storing only happens in the workers (local institution). As a result, LQM can properly maintain data privacy for all the institutions.
3.2 Safe Screening Rules for Lasso
The dual problem of Lasso (1) can be formulated as the following equation:
(3) 
where is the dual variable and denotes the th column of . Let be the optimal solution of problem (3) and denotes the optimal solution of problem (1). The Karush–Kuhn–Tucker (KKT) conditions are given by:
(4) 
(5) 
where denotes the th component of . In view of the KKT condition in equation (5), the following rule holds: .
The inactive features have zero components in the optimal solution vector so that we can remove them from the optimization without sacrificing the accuracy of the optimal value in the objective function (1). We call this kind of screening methods as Safe Screening Rules. SAFE [3] is one of highly efficient safe screening methods. In SAFE, the th entry of is discarded when
(6) 
where . As a result, the optimization can be performed on the reduced data matrix and the original problem (1) can be reformulated as:
(7) 
where is the number of remaining features after employing safe screening rules. The optimization is performed on a reduced feature matrix, accelerating the whole learning process significantly.
3.3 Distributed Safe Screening Rules for Lasso
As data are distributed to different institutions, we develop a family of distributed Lasso screening rule to identify and discard inactive features in a distributed environment. Suppose th institution holds the data set , we summarize a distributed version of SAFE screening rules (DSAFE) as follows:
Step 1: , update by LQM.
Step 2:
Step 3: If , discard th feature.
To compute in Step , we first compute and perform LQM to compute by . Then, we have . Similarly, we can compute in Step . As the data communication only requires intermediate results, DSAFE preserves the data privacy at each institution.
To tune the value of , commonly used methods such as cross validation need to solve the Lasso problem along a sequence of parameters , which can be very timeconsuming. Enhanced Dual Polytope Projection (EDPP) [10] is a highly efficient safe screening rules. Implementation details of EDPP is available on the GitHub: http://dpcscreening.github.io/lasso.html.
To address the problem of data privacy, we propose a distributed Lasso screening rule, termed Distributed Enhanced Dual Polytope Projection (DEDPP), to identify and discard inactive features along a sequence of parameter values in a distributed manner. The idea of DEDPP is similar to LQM. Specifically, to update the global variables, we apply LQM to query each local center for intermediate results–computed locally–and we aggregate them at global center. After obtaining the reduced matrix for each institution, we apply LQM to solve the Lasso problem on the reduced data set , . We assume that indicates the th column in , , where is the number of features. We summarize the proposed DEDPP in Algorithm 1.
To calculate , we apply LQM through aggregating all the together in the global center by and send back to every institution. The same approach is used to calculate , and in DEDPP. The calculation of and follows the same way in DSAFE. The discarding result of relies on the previous optimal solution . Especially, equals to when is zero. Thus, we identify all the elements to be zero at . When is , we can perform screening based on .
3.4 Local Query Model for Lasso
To further accelerate the learning process, we apply FISTA [1] to solve the Lasso problem in a distributed manner. The convergence rate of FISTA is compared to of ISTA, where is the iteration number. We integrate FISTA with LQM (FLQM) to solve the Lasso problem on the reduced matrix . We summarize the updating rule of FLQM in th iteration as follows:
Step 1: , update by LQM.
Step 2: and .
Step 3: .
The matrix denotes the reduced matrix for the th institution obtained by DEDPP rule. We repeat this procedure until a satisfactory global model is obtained. Step 1 calculates from local data . Then, each institution performs LQM to get the gradient based on (5). Step 2 updates the auxiliary variables and step size . Step 3 updates the model . Similar to LQM, the data privacy of institutions are well preserved by FLQM.
4 Experiment
We implement the proposed framework across three institutions on a stateoftheart distributed platform—Apache Spark—a fast and efficient distributed platform for largescale data computing. Experiment shows the efficiency and effectiveness of proposed models.
4.1 Comparison of Lasso with and without DEDPP rule
We choose the volume of lateral ventricle as variables being predicted in trials containing 717 subjects by removing subjects without labels. The volumes of brain regions were extracted from each subject’s T1 MRI scan using Freesurfer: http://freesurfer.net. We evaluate the efficiency of DEDPP across three research institutions that maintain 326, 215, and 176 subjects, respectively. The subjects are stored as HDFS files. We solve the Lasso problem along a sequence of 100 parameter values equally spaced on the linear scale of from 1.00 to 0.05. We randomly select 0.1 million to 1 million features by applying FLQM since [1] proved that FISTA converges faster than ISTA. We report the result in Fig. 2 and achieved about a speedup of 66fold compared to FLQM.
4.2 Stability selection for top risk genetic factors
We employ stability selection [6, 11] with DEDPP+FLQM to select top risk SNPs from the entire GWAS with 5,906,152 features. We conduct four groups of trials in Table 1. In each trial, DEDPP+FLQM is carried out along a 100 linearscale sequence from 1 to 0.05. We simulate this 200 times and perform on 500 of subjects in each round. Table 1 shows the top 5 selected SNPs. APOE, one of the top genetic risk factors for AD [5], is ranked #1 for three groups.
Diagnose at baseline  Hippocampus at baseline  

No.  Chr  Position  RS_ID  Gene  No.  Chr  Position  RS_ID  Gene 
1  19  45411941  rs429358  APOE  1  19  45411941  rs429358  APOE 
2  19  45410002  rs769449  APOE  2  8  145158607  rs34173062  SHARPIN 
3  12  9911736  rs3136564  CD69  3  11  11317240  rs10831576  GALNT18 
4  1  172879023  rs2227203  unknown  4  10  71969989  rs12412466  PPA1 
5  20  58267891  rs6100558  PHACTR3  5  6  168107162  rs71573413  unknown 
Entorhinal cortex at baseline  Lateral ventricle at baseline  
No.  Chr  Position  RS_ID  Gene  No.  Chr  Position  RS_ID  Gene 
1  19  45411941  rs429358  APOE  1  Y  3164319  rs2261174  unknown 
2  15  89688115  rs8025377  ABHD2  2  10  62162053  rs10994327  ANK3 
3  Y  10070927  rs79584829  unknown  3  Y  13395084  rs62610496  unknown 
4  14  47506875  rs41354245  MDGA2  4  1  77895410  rs2647521  AK5 
5  3  30106956  rs55904134  unknown  5  1  114663751  rs2629810  SYT6 
4.2.1 Acknowledgments
This work was supported in part by NIH Big Data to Knowledge (BD2K) Center of Excellence grant U54 EB020403, funded by a crossNIH consortium including NIBIB and NCI.
References
 [1] Beck, A., Teboulle, M.: A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202 (2009)
 [2] Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics 57(11), 1413–1457 (2004)
 [3] Ghaoui, L.E., Viallon, V., Rabbani, T.: Safe feature elimination for the lasso and sparse supervised learning problems. arXiv preprint arXiv:1009.4219 (2010)
 [4] Harold, D., et al.: Genomewide association study identifies variants at clu and picalm associated with alzheimer’s disease. Nature genetics 41(10), 1088–1093 (2009)
 [5] Liu, C.C., Kanekiyo, T., Xu, H., Bu, G.: Apolipoprotein e and alzheimer disease: risk, mechanisms and therapy. Nature Reviews Neurology 9(2), 106–118 (2013)
 [6] Meinshausen, N., Bühlmann, P.: Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4), 417–473 (2010)
 [7] Sasieni, P.D.: From genotypes to genes: doubling the sample size. Biometrics pp. 1253–1261 (1997)

[8]
ShalevShwartz, S., Tewari, A.: Stochastic methods for l 1regularized loss minimization. The Journal of Machine Learning Research 12, 1865–1892 (2011)
 [9] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) pp. 267–288 (1996)
 [10] Wang, J., Zhou, J., Wonka, P., Ye, J.: Lasso screening rules via dual polytope projection. In: Advances in Neural Information Processing Systems (2013)
 [11] Yang, T., et al.: Detecting genetic risk factors for alzheimer’s disease in whole genome sequence data via lasso screening. In: IEEE International Symposium on Biomedical Imaging. pp. 985–989 (2015)
Comments
There are no comments yet.