Secure multi-party linear regression at plaintext speed

by   Jonathan M. Bloom, et al.
Broad Institute

We detail a scheme for scalable, distributed, secure multiparty linear regression at essentially the same speed as plaintext regression. While the core ideas are simple, the recognition of their broad utility when combined is novel. By leveraging a recent advance in secure multiparty principal component analysis, our scheme opens the door to efficient and secure genome-wide association studies across multiple biobanks.



There are no comments yet.


page 1

page 2

page 3

page 4


High Performance Logistic Regression for Privacy-Preserving Genome Analysis

In this paper, we present a secure logistic regression training protocol...

A semi-group approach to Principal Component Analysis

Principal Component Analysis (PCA) is a well known procedure to reduce i...

Machine Learning on Cloud with Blockchain: A Secure, Verifiable and Fair Approach to Outsource the Linear Regression

Linear Regression (LR) is a classical machine learning algorithm which h...

Orthonormal Sketches for Secure Coded Regression

In this work, we propose a method for speeding up linear regression dist...

Structure-Property Maps with Kernel Principal Covariates Regression

Data analysis based on linear methods, which look for correlations betwe...

Sketching for Principal Component Regression

Principal component regression (PCR) is a useful method for regularizing...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Preface

I originally conceived, wrote, and shared the following note the weekend of May 5-7, 2017. While the core ideas are simple, their broad utility in combination for privacy-preserving multi-party linear regression appears to still be novel.

I was personally motivated by the application to genome-wide association studies (GWAS) in which several centers have sets of genomes and corresponding phenotypes that cannot be shared. At the time, there was still no way to run principal components analysis securely at scale in order to control for confounding by ancestry. So I was very excited to discover that recently Hyunghoon Cho and colleagues dramatically improved the scalability of secure-multiparty PCA, with application to secure GWAS in the model in which each individual secret-shares their genome111Hyunghoon Cho, David J Wu, Bonnie Berger. Secure genome-wide association analysis using multiparty computation. Nature Biotechnology volume 36, p. 547-551 (May 2018). With secure PCA in hand, the ideas below enable secure multi-party GWAS at the other extreme of collaboration between, say, a dozen large biobanks, with the regression step itself done scalably and with essentially the same efficiency as plaintext computation.

One can imagine a future in which secure multi-party GWAS is done on a public cloud in online fashion as new batches of samples come online. Those regressions that suggest promising hits might motivate more intensive open collaboration on select data in order to bring to bear more sophisticated quality control and statistical models en route to a joint search for biological mechanism and therapeutic target.

2 Association scan

We will call the following variation on linear regression an association scan. Suppose we have positive integers , , and with and data for samples:

  • , an -dimensional responsevector.

  • , an matrix of transient covariate vectors.

  • , an matrix of linearly independent permanent covariate vectors.

Let denote the th column of , e.g., the th transient covariate vector. We now think of as a single draw from an

-dimensional normal distribution mean parameters a real number

and a -vector

, and variance parameter




be the maximum likelihood estimate for the transient coefficient and let

be the standard error of this estimate. Then under the null hypothesis

, the statistic is drawn from a -distribution with degrees of freedom.

Association scan problem: determine the vectors and efficiently and scalably; the vectors of t-statistics and p-values then follow.

Example: In genome wide association studies, which scan the genome for correlation of genetic and phenotypic variation, we have samples (individuals), common variants to test one by one, and sample-level covariates like intercept, age, sex, batch, and principal component coordinates. Typically is to , is to , and is 1 to . In gene burden tests, is about .

Let be an matrix whose columns form an orthonormal basis for the column space of . Let denote the vector with values . Let denote the vector with values . Let denotes coordinate-wise squaring of .

Lemma 2.1.

A closed form solution to the association scan problem:


Plimpton 332 tablet. ∎

Algorithm: We assume the columns of are distributed across machines with total cores.

  1. Compute and broadcast

    using QR decomposition.

  2. Compute and broadcast , , and .

  3. In parallel, compute , , and .

  4. In parallel, compute and as in Lemma 2.1.

Computing and dominate the computational complexity as


In practice we consider as a small constant so the complexity is


i.e. that of reading the data and therefore best possible with no further assumptions on the entropy of . For further gains, QR decomposition can also be parallelized222Tall and skinny QR factorizations in MapReduce architectures, and the columns of can be packed sparsely so that the flop count for is reduced in proportion to the sparsity of .

3 Secure multi-party association scan

Now suppose the samples are divided among parties who are not willing or able to share their data. For simplicity of notation, we will suppose , with Alice, Bob, and Carla holding , , and samples, respectively.

We also assume , , and have full column-rank.

In such situations, analysts typically have no recourse but to meta-analyze within-party estimates, with loss of power due to noisy standard errors as well as between-group heterogeneity (c.f. Simpson’s paradox). Being power hungry, we instead solve the:

Secure multi-party association scan problem: securely determine the vectors and efficiently and scalably while communicating only bits inter-party.

Note that is best possible since all parties must receive the results. In fact, our secure algorithm has the same distributed computational complexity as before.

QR algorithm: The first aim is to securely provide Alice, Bob, and Carla with their respective rows of


First Alice, Bob, and Carla simultaneously compute , , and in the QR decomposition of , , and , respectively. The resulting matrices depend only on the orbit of under product-preserving isometry of

Furthermore, each upper triangular matrix contains only real numbers, independent of ; these effectively describe the angles between pairs of permanent covariates.

So we assume that , , and are sufficiently large relative to that Alice, Bob, and Carla are perfectly happy333For greater security, one could employ a binary tree with levels such that parties only share their matrix directly in pairs (see first footnote). With so small, it’s also feasible to use SMC to compute without leaking any additional information. to disclose , , and in order to compute in the QR decomposition of the (tiny) matrix

The for coincides with that for , so now the parties can privately compute:

By Lemma 2.1, it now suffices to compute the following six quantities (those in the first row are numbers, the rest are -vectors):


is an orthogonal decomposition of , Alice, Bob, and Carla can compute the three left-hand quantities by computing their internal summands and then either sharing them to sum or or applying an SMC sum protocol which only reveals the overall sum:

The three right-hand quantities are trickier because the orthogonal projection

does not preserve orthogonality between vectors. Hence the -vector decompositions

are not orthogonal decompositions. So instead the parties can compute the -vector and the matrix by computing their internal summands and either sharing them to sum or by applying an SMC sum protocol which only reveals the overall sum (for even greater security, they can use a more sophisticated SMC algorithm to only share the three right-hand quantities (two dot products of -vectors for each )). In all cases, these SMC protocols (if needed at all!) are fast because they require only simple secret sharing on tiny data, parallelize over , and are independent of .

Note that adding an intercept covariate is equivalent to translating and each column of to have zero mean. Adding an intercept for each party (i.e., indicator covariates to control for batch effects) is equivalent to mean centering and each column of , , and independently.

4 R demo

The following R code demonstrates our scheme, which we call the Distributed Association Scan Hammer (DASH). This code is also available at


dot <- function(x){
  return(sum(x * x))

# Public
N1 = 1000
N2 = 2000
N3 = 1500
M = 10000
K = 3

D = N1 + N2 + N3 - K - 1

# Alice
y1 = rnorm(N1)
X1 = matrix(rnorm(N1 * M), N1, M)
C1 = matrix(rnorm(N1 * K), N1, K)
R1 = qr.R(qr(C1))

# Bob
y2 = rnorm(N2)
X2 = matrix(rnorm(N2 * M), N2, M)
C2 = matrix(rnorm(N2 * K), N2, K)
R2 = qr.R(qr(C2))

# Carla
y3 = rnorm(N3)
X3 = matrix(rnorm(N3 * M), N3, M)
C3 = matrix(rnorm(N3 * K), N3, K)
R3 = qr.R(qr(C3))

# Public or tree or SMC
invR = solve(qr.R(qr(rbind(R1, R2, R3))))

# Alice
Q1 = C1 %*% invR
Qty1 = t(Q1) %*% y1
QtX1 = t(Q1) %*% X1

yy1 = dot(y1)
Xy1 = t(X1) %*% y1
XX1 = apply(X1,2,dot)

# Bob
Q2 = C2 %*% invR
Qty2 = t(Q2) %*% y2
QtX2 = t(Q2) %*% X2

yy2 = dot(y2)
Xy2 = t(X2) %*% y2
XX2 = apply(X2, 2, dot)

# Carla
Q3 = C3 %*% invR
Qty3 = t(Q3) %*% y3
QtX3 = t(Q3) %*% X3

yy3 = dot(y3)
Xy3 = t(X3) %*% y3
XX3 = apply(X3, 2, dot)

# Public or SMC
yy = yy1 + yy2 + yy3
Xy = Xy1 + Xy2 + Xy3
XX = XX1 + XX2 + XX3

Qty = Qty1 + Qty2 + Qty3
QtX = QtX1 + QtX2 + QtX3

QtyQty = dot(Qty)
QtXQty = t(QtX) %*% Qty
QtXQtX = apply(QtX, 2, dot)

yyq = yy - QtyQty
Xyq = Xy - QtXQty
XXq = XX - QtXQtX

# Public
beta = Xyq / XXq
sigma = sqrt((yyq / XXq - beta^2) / D)
tstat = beta / sigma
pval = 2 * pt(-abs(tstat), D)

df = data.frame(beta=beta, sigma=sigma, tstat=tstat, pval=pval)

# Compare to primary analysis for first M0 columns of $X$
M0 = 5

y = c(y1 ,y2, y3)
X = rbind(X1, X2, X3)
C = rbind(C1, C2, C3)

res = matrix(nrow=0,ncol=4)
for (m in 1:M0) {
  fit = lm(y ~ X[,m] + C - 1)
  res = rbind(res,as.vector(summary(fit)$coefficients[1,]))

df2 = data.frame(beta=res[,1], sigma=res[,2], tstat=res[,3], pval=res[,4])

all.equal(df[1:M0,],df2) # Returns TRUE

5 Generalizations

This approach efficiently generalizes to the case of multiple transient covariants (such as interaction terms) or multiple phenotypes (such as will biobanks or eQTL studies). If an (eigendecomposition of) the kinship kernel can be shared, then the approach extends to linear mixed models as well. Gene burden tests (where linear combination of genotypes become gene scores) also play well with this approach, since they involve linear projection on the variant axis rather than the sample axis. Thankfully, matrix multiplication is associative.

Note also that one can alternatively compress using rather than to preserve the ability to select phenotypes and covariates post-compression.

6 Acknowledgements

I am grateful to Alex Bloemendal who helped me derive Lemma 2.1 (a classic result) as we sought to optimize linear regression for GWAS in the open-source, distributed system Hail ( Without our intensive linear algebra discussions, I would not have recognized the relevance of Lemma 2.1 combined with TSQR for defining a “doubly-distributed” linear regression algorithm that plays well with privacy preservation.