1 Preface
I originally conceived, wrote, and shared the following note the weekend of May 57, 2017. While the core ideas are simple, their broad utility in combination for privacypreserving multiparty linear regression appears to still be novel.
I was personally motivated by the application to genomewide association studies (GWAS) in which several centers have sets of genomes and corresponding phenotypes that cannot be shared. At the time, there was still no way to run principal components analysis securely at scale in order to control for confounding by ancestry. So I was very excited to discover that recently Hyunghoon Cho and colleagues dramatically improved the scalability of securemultiparty PCA, with application to secure GWAS in the model in which each individual secretshares their genome^{1}^{1}1Hyunghoon Cho, David J Wu, Bonnie Berger. Secure genomewide association analysis using multiparty computation. Nature Biotechnology volume 36, p. 547551 (May 2018). With secure PCA in hand, the ideas below enable secure multiparty GWAS at the other extreme of collaboration between, say, a dozen large biobanks, with the regression step itself done scalably and with essentially the same efficiency as plaintext computation.
One can imagine a future in which secure multiparty GWAS is done on a public cloud in online fashion as new batches of samples come online. Those regressions that suggest promising hits might motivate more intensive open collaboration on select data in order to bring to bear more sophisticated quality control and statistical models en route to a joint search for biological mechanism and therapeutic target.
2 Association scan
We will call the following variation on linear regression an association scan. Suppose we have positive integers , , and with and data for samples:

, an dimensional responsevector.

, an matrix of transient covariate vectors.

, an matrix of linearly independent permanent covariate vectors.
Let denote the th column of , e.g., the th transient covariate vector. We now think of as a single draw from an
dimensional normal distribution mean parameters a real number
and a vector, and variance parameter
:(1) 
Let
be the maximum likelihood estimate for the transient coefficient and let
be the standard error of this estimate. Then under the null hypothesis
, the statistic is drawn from a distribution with degrees of freedom.Association scan problem: determine the vectors and efficiently and scalably; the vectors of tstatistics and pvalues then follow.
Example: In genome wide association studies, which scan the genome for correlation of genetic and phenotypic variation, we have samples (individuals), common variants to test one by one, and samplelevel covariates like intercept, age, sex, batch, and principal component coordinates. Typically is to , is to , and is 1 to . In gene burden tests, is about .
Let be an matrix whose columns form an orthonormal basis for the column space of . Let denote the vector with values . Let denote the vector with values . Let denotes coordinatewise squaring of .
Lemma 2.1.
A closed form solution to the association scan problem:
(2)  
(3) 
Proof.
Plimpton 332 tablet. ∎
Algorithm: We assume the columns of are distributed across machines with total cores.

Compute and broadcast
using QR decomposition.

Compute and broadcast , , and .

In parallel, compute , , and .

In parallel, compute and as in Lemma 2.1.
Computing and dominate the computational complexity as
(4) 
In practice we consider as a small constant so the complexity is
(5) 
i.e. that of reading the data and therefore best possible with no further assumptions on the entropy of . For further gains, QR decomposition can also be parallelized^{2}^{2}2Tall and skinny QR factorizations in MapReduce architectures, https://pdfs.semanticscholar.org/747c/a08cbf258da8d2b89ba31f24bdb17d7132bb.pdf and the columns of can be packed sparsely so that the flop count for is reduced in proportion to the sparsity of .
3 Secure multiparty association scan
Now suppose the samples are divided among parties who are not willing or able to share their data. For simplicity of notation, we will suppose , with Alice, Bob, and Carla holding , , and samples, respectively.
We also assume , , and have full columnrank.
In such situations, analysts typically have no recourse but to metaanalyze withinparty estimates, with loss of power due to noisy standard errors as well as betweengroup heterogeneity (c.f. Simpson’s paradox). Being power hungry, we instead solve the:
Secure multiparty association scan problem: securely determine the vectors and efficiently and scalably while communicating only bits interparty.
Note that is best possible since all parties must receive the results. In fact, our secure algorithm has the same distributed computational complexity as before.
QR algorithm: The first aim is to securely provide Alice, Bob, and Carla with their respective rows of
where
First Alice, Bob, and Carla simultaneously compute , , and in the QR decomposition of , , and , respectively. The resulting matrices depend only on the orbit of under productpreserving isometry of
Furthermore, each upper triangular matrix contains only real numbers, independent of ; these effectively describe the angles between pairs of permanent covariates.
So we assume that , , and are sufficiently large relative to that Alice, Bob, and Carla are perfectly happy^{3}^{3}3For greater security, one could employ a binary tree with levels such that parties only share their matrix directly in pairs (see first footnote). With so small, it’s also feasible to use SMC to compute without leaking any additional information. to disclose , , and in order to compute in the QR decomposition of the (tiny) matrix
The for coincides with that for , so now the parties can privately compute:
By Lemma 2.1, it now suffices to compute the following six quantities (those in the first row are numbers, the rest are vectors):
Since
is an orthogonal decomposition of , Alice, Bob, and Carla can compute the three lefthand quantities by computing their internal summands and then either sharing them to sum or or applying an SMC sum protocol which only reveals the overall sum:
The three righthand quantities are trickier because the orthogonal projection
does not preserve orthogonality between vectors. Hence the vector decompositions
are not orthogonal decompositions. So instead the parties can compute the vector and the matrix by computing their internal summands and either sharing them to sum or by applying an SMC sum protocol which only reveals the overall sum (for even greater security, they can use a more sophisticated SMC algorithm to only share the three righthand quantities (two dot products of vectors for each )). In all cases, these SMC protocols (if needed at all!) are fast because they require only simple secret sharing on tiny data, parallelize over , and are independent of .
Note that adding an intercept covariate is equivalent to translating and each column of to have zero mean. Adding an intercept for each party (i.e., indicator covariates to control for batch effects) is equivalent to mean centering and each column of , , and independently.
4 R demo
The following R code demonstrates our scheme, which we call the Distributed Association Scan Hammer (DASH). This code is also available at
https://github.com/jbloom22/DASH/
set.seed(0) dot < function(x){ return(sum(x * x)) } # Public N1 = 1000 N2 = 2000 N3 = 1500 M = 10000 K = 3 D = N1 + N2 + N3  K  1 # Alice y1 = rnorm(N1) X1 = matrix(rnorm(N1 * M), N1, M) C1 = matrix(rnorm(N1 * K), N1, K) R1 = qr.R(qr(C1)) # Bob y2 = rnorm(N2) X2 = matrix(rnorm(N2 * M), N2, M) C2 = matrix(rnorm(N2 * K), N2, K) R2 = qr.R(qr(C2)) # Carla y3 = rnorm(N3) X3 = matrix(rnorm(N3 * M), N3, M) C3 = matrix(rnorm(N3 * K), N3, K) R3 = qr.R(qr(C3)) # Public or tree or SMC invR = solve(qr.R(qr(rbind(R1, R2, R3)))) # Alice Q1 = C1 %*% invR Qty1 = t(Q1) %*% y1 QtX1 = t(Q1) %*% X1 yy1 = dot(y1) Xy1 = t(X1) %*% y1 XX1 = apply(X1,2,dot) # Bob Q2 = C2 %*% invR Qty2 = t(Q2) %*% y2 QtX2 = t(Q2) %*% X2 yy2 = dot(y2) Xy2 = t(X2) %*% y2 XX2 = apply(X2, 2, dot) # Carla Q3 = C3 %*% invR Qty3 = t(Q3) %*% y3 QtX3 = t(Q3) %*% X3 yy3 = dot(y3) Xy3 = t(X3) %*% y3 XX3 = apply(X3, 2, dot) # Public or SMC yy = yy1 + yy2 + yy3 Xy = Xy1 + Xy2 + Xy3 XX = XX1 + XX2 + XX3 Qty = Qty1 + Qty2 + Qty3 QtX = QtX1 + QtX2 + QtX3 QtyQty = dot(Qty) QtXQty = t(QtX) %*% Qty QtXQtX = apply(QtX, 2, dot) yyq = yy  QtyQty Xyq = Xy  QtXQty XXq = XX  QtXQtX # Public beta = Xyq / XXq sigma = sqrt((yyq / XXq  beta^2) / D) tstat = beta / sigma pval = 2 * pt(abs(tstat), D) df = data.frame(beta=beta, sigma=sigma, tstat=tstat, pval=pval) # Compare to primary analysis for first M0 columns of $X$ M0 = 5 y = c(y1 ,y2, y3) X = rbind(X1, X2, X3) C = rbind(C1, C2, C3) res = matrix(nrow=0,ncol=4) for (m in 1:M0) { fit = lm(y ~ X[,m] + C  1) res = rbind(res,as.vector(summary(fit)$coefficients[1,])) } df2 = data.frame(beta=res[,1], sigma=res[,2], tstat=res[,3], pval=res[,4]) all.equal(df[1:M0,],df2) # Returns TRUE
5 Generalizations
This approach efficiently generalizes to the case of multiple transient covariants (such as interaction terms) or multiple phenotypes (such as will biobanks or eQTL studies). If an (eigendecomposition of) the kinship kernel can be shared, then the approach extends to linear mixed models as well. Gene burden tests (where linear combination of genotypes become gene scores) also play well with this approach, since they involve linear projection on the variant axis rather than the sample axis. Thankfully, matrix multiplication is associative.
Note also that one can alternatively compress using rather than to preserve the ability to select phenotypes and covariates postcompression.
6 Acknowledgements
I am grateful to Alex Bloemendal who helped me derive Lemma 2.1 (a classic result) as we sought to optimize linear regression for GWAS in the opensource, distributed system Hail (www.hail.is). Without our intensive linear algebra discussions, I would not have recognized the relevance of Lemma 2.1 combined with TSQR for defining a “doublydistributed” linear regression algorithm that plays well with privacy preservation.
Comments
There are no comments yet.