Helen: Maliciously Secure Coopetitive Learning for Linear Models

by   Wenting Zheng, et al.

Many organizations wish to collaboratively train machine learning models on their combined datasets for a common benefit (e.g., better medical research, or fraud detection). However, they often cannot share their plaintext datasets due to privacy concerns and/or business competition. In this paper, we design and build Helen, a system that allows multiple parties to train a linear model without revealing their data, a setting we call coopetitive learning. Compared to prior secure training systems, Helen protects against a much stronger adversary who is malicious and can compromise m-1 out of m parties. Our evaluation shows that Helen can achieve up to five orders of magnitude of performance improvement when compared to training using an existing state-of-the-art secure multi-party computation framework.



There are no comments yet.


page 1

page 2

page 3

page 4


Senate: A Maliciously-Secure MPC Platform for Collaborative Analytics

Many organizations stand to benefit from pooling their data together in ...

Secure Data Sharing With Flow Model

In the classical multi-party computation setting, multiple parties joint...

Secure Collaborative Training and Inference for XGBoost

In recent years, gradient boosted decision tree learning has proven to b...

SIRNN: A Math Library for Secure RNN Inference

Complex machine learning (ML) inference algorithms like recurrent neural...

Secure PAC Bayesian Regression via Real Shamir Secret Sharing

Common approach of machine learning is to generate a model by using huge...

From Fairness to Full Security in Multiparty Computation

In the setting of secure multiparty computation (MPC), a set of mutually...

Vectorized Secure Evaluation of Decision Forests

As the demand for machine learning-based inference increases in tandem w...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Today, many organizations are interested in training machine learning models over their aggregate sensitive data. The parties also agree to release the model to every participant so that everyone can benefit from the training process. In many existing applications, collaboration is advantageous because training on more data tends to yield higher quality models [42]. Even more exciting is the potential of enabling new applications that are not possible to compute using a single party’s data because they require training on complementary data from multiple parties (e.g., geographically diverse data). However, the challenge is that these organizations cannot share their sensitive data in plaintext due to privacy policies and regulations [3] or due to business competition [69]. We denote this setting using the term coopetitive learning111We note that Google uses the term federated learning [69] for a different but related setting: a semi-trusted cloud trains a model over the data of millions of user devices, which are intermittently online, and sees sensitive intermediate data. , where the word “coopetition” [31] is a portmanteau of “cooperative” and “competitive”. To illustrate coopetitive learning’s potential impact as well as its challenges, we summarize two concrete use cases.

A banking use case. The first use case was shared with us by two large banks in North America. Many banks want to use machine learning to detect money laundering more effectively. Since criminals often hide their traces by moving assets across different financial institutions, an accurate model would require training on data from different banks. Even though such a model would benefit all participating banks, these banks cannot share their customers’ data in plaintext because of privacy regulations and business competition.

A medical use case.

The second use case was shared with us by a major healthcare provider who needs to distribute vaccines during the annual flu cycle. In order to launch an effective vaccination campaign (i.e., sending vans to vaccinate people in remote areas), this organization would like to identify areas that have high probabilities of flu outbreaks using machine learning. More specifically, this organization wants to train a linear model over data from seven geographically diverse medical organizations. Unfortunately, such training is impossible at this moment because the seven organizations cannot share their patient data with each other due to privacy regulations.

Figure 1: The setting of coopetitive learning.

The general setup of coopetitive learning fits within the cryptographic framework of secure multi-party computation (MPC) [8, 39, 72]. Unfortunately, implementing training using generic MPC frameworks is extremely inefficient, so recent training systems [58, 43, 56, 36, 21, 37, 5] opt for tailored protocols instead. However, many of these systems rely on outsourcing to non-colluding servers, and all assume a passive attacker who never deviates from the protocol. In practice, these assumptions are often not realistic because they essentially require an organization to base the confidentiality of its data on the correct behavior of other organizations. In fact, the banks from the aforementioned use case informed us that they are not comfortable with trusting the behavior of their competitors when it comes to sensitive business data.

Hence, we need a much stronger security guarantee: each organization should only trust itself. This goal calls for maliciously secure MPC in the setting where out of parties can fully misbehave.

In this paper, we design and build Helen, a platform for maliciously secure coopetitive learning. Helen supports a significant slice of machine learning and statistics problems: regularized linear models. This family of models includes ordinary least squares regression, ridge regression, and LASSO. Because these models are statistically robust and easily interpretable, they are widely used in cancer research 

[50], genomics [29, 61], financial risk analysis [65, 18], and are the foundation of basis pursuit techniques in signals processing.

The setup we envision for Helen is similar to the use cases above: a few organizations (usually less than ) have large amounts of data (on the order of hundreds of thousands or millions of records) with a smaller number of features (on the order of tens or hundreds).

While it is possible to build such a system by implementing a standard training algorithm like Stochastic Gradient Descent (SGD) 

[63] using a generic maliciously secure MPC protocol, the result is very inefficient. To evaluate the practical performance difference, we implemented SGD using SPDZ, a maliciously secure MPC library [1]. For a configuration of parties, and a real dataset of K data points per party and

features, such a baseline can take an estimated time of 3 months to train a linear regression model. Using a series of techniques explained in the next section, Helen can train the same model in less than 3 hours.

1.1 Overview of techniques

To solve such a challenging problem, Helen combines insights from cryptography, systems, and machine learning. This synergy enables an efficient and scalable solution under a strong threat model. One recurring theme in our techniques is that, while the overall training process needs to scale linearly with the total number of training samples, the more expensive cryptographic computation can be reformulated to be independent of the number of samples.

Our first insight is to leverage a classic but under-utilized technique in distributed convex optimization called Alternating Direction Method of Multipliers (ADMM) [15]. The standard algorithm for training models today is SGD, which optimizes an objective function by iterating over the input dataset. With SGD, the number of iterations scales at least linearly with the number of data samples. Therefore, naïvely implementing SGD using a generic MPC framework would require an expensive MPC synchronization protocol for every iteration. Even though ADMM is less popular for training on plaintext data, we show that it is much more efficient for cryptographic training than SGD. One advantage of ADMM is that it converges in very few iterations (e.g., a few tens) because each party repeatedly solves local optimization problems. Therefore, utilizing ADMM allows us to dramatically reduce the number of MPC synchronization operations. Moreover, ADMM is very efficient in the context of linear models because the local optimization problems can be solved by closed form solutions. These solutions are also easily expressible in cryptographic computation and are especially efficient because they operate on small summaries of the input data that only scale with the dimension of the dataset.

However, merely expressing ADMM in MPC does not solve an inherent scalability problem. As mentioned before, Helen addresses a strong threat model in which an attacker can deviate from the protocol. This malicious setting requires the protocol to ensure that the users’ behavior is correct. To do so, the parties need to commit to their input datasets and prove that they are consistently using the same datasets throughout the computation. A naïve way of solving this problem is to have each party commit to the entire input dataset and calculate the summaries using MPC. This is problematic because 1) the cryptographic computation will scale linearly in the number of samples, and 2) calculating the summaries would also require Helen to calculate complex matrix inversions within MPC (similar to [59]

). Instead, we make a second observation that each party can use singular value decomposition (SVD) 

[40] to decompose its input summaries into small matrices that scale only in the number of features. Each party commits to these decomposed matrices and proves their properties using matrix multiplication to avoid explicit matrix inversions.

Finally, one important aspect of ADMM is that it enables decentralized computation. Each optimization iteration consists of two phases: local optimization and coordination. The local optimization phase requires each party to solve a local sub-problem. The coordination phase requires all parties to synchronize their local results into a single set of global weights. Expressing both phases in MPC would encode local optimization into a computation that is done by every party, thus losing the decentralization aspect of the original protocol. Instead, we observe that the local operations are all linear matrix operations between the committed summaries and the global weights. Each party knows the encrypted global weights, as well as its own committed summaries in plaintext. Therefore, Helen uses partially homomorphic encryption to encrypt the global weights so that each party can solve the local problems in a decentralized manner, and enables each party to efficiently prove in zero-knowledge that it computed the local optimization problem correctly.

2 Background

2.1 Preliminaries

In this section, we describe the notation we use for the rest of the paper. Let denote the parties. Let denote the set of integers modulo , and denote the set of integers modulo a prime . Similarly, we use to denote the multiplicative group modulo .

We use to denote a scalar,

to denote a vector, and

to denote a matrix. We use to denote an encryption of under a public key PK. Similarly, denotes a decryption of under the secret key SK.

Each party has a feature matrix , where is the number of samples per party and is the feature dimension. is the labels vector. The machine learning datasets use floating point representation, while our cryptographic primitives use groups and fields. Therefore, we represent the dataset using fixed point integer representation.

2.2 Cryptographic building blocks

In this section, we provide a brief overview of the cryptographic primitives used in Helen.

2.2.1 Threshold partially homomorphic encryption

A partially homomorphic encryption scheme is a public key encryption scheme that allows limited computation over the ciphertexts. For example, Paillier [60] is an additive homomorphic encryption scheme: multiplying two ciphertexts together (in a certain group) generates a new ciphertext such that its decryption yields the sum of the two original plaintexts. Anyone with the public key can encrypt and manipulate the ciphertexts based on their homomorphic property. This encryption scheme also acts as a perfectly binding and computationally hiding homomorphic commitment scheme [41], another property we use in Helen.

A threshold variant of such a scheme has some additional properties. While the public key is known to everyone, the secret key is split across a set of parties such that a subset of them must participate together to decrypt a ciphertext. If not enough members participate, the ciphertext cannot be decrypted. The threshold structure can be altered based on the adversarial assumption. In Helen, we use a threshold structure where all parties must participate in order to decrypt a ciphertext.

2.2.2 Zero knowledge proofs

Informally, zero knowledge proofs are proofs that prove that a certain statement is true without revealing the prover’s secret for this statement. For example, a prover can prove that there is a solution to a Sudoku puzzle without revealing the actual solution. Zero knowledge proofs of knowledge additionally prove that the prover indeed knows the secret. Helen uses modified -protocols [26] to prove properties of a party’s local computation. The main building blocks we use are ciphertext proof of plaintext knowledge, plaintext-ciphertext multiplication, and ciphertext interval proof of plaintext knowledge [24, 14], as we further explain in Section 4. Note that -protocols are honest verifier zero knowledge, but can be transformed into full zero-knowledge using existing techniques [25, 33, 35]. In our paper, we present our protocol using the -protocol notation.

2.2.3 Malicious MPC

We utilize SPDZ [28], a state-of-the-art malicious MPC protocol, for both Helen and the secure baseline we evaluate against. Another recent malicious MPC protocol is authenticated garbled circuits [71], which supports boolean circuits. We decided to use SPDZ for our baseline because the majority of the computation in SGD is spent doing matrix operations, which is not efficiently represented in boolean circuits. For the rest of this section we give an overview of the properties of SPDZ.

An input to SPDZ is represented as , where is a share of and is the MAC share authenticating under a SPDZ global key . Player holds , and is public. During a correct SPDZ execution, the following property must hold: and . The global key is not revealed until the end of the protocol; otherwise the malicious parties can use to construct new MACs.

SPDZ has two phases: an offline phase and an online phase. The offline phase is independent of the function and generates precomputed values that can be used during the online phase, while the online phase executes the designated function.

2.3 Learning and Convex Optimization

Much of contemporary machine learning can be framed in the context of minimizing the cumulative error

(or loss) of a model over the training data. While there is considerable excitement around deep neural networks, the vast majority of real-world machine learning applications still rely on robust linear models because they are well understood and can be efficiently and reliably learned using established convex optimization procedures.

In this work, we focus on linear models with squared error and various forms of regularization resulting in the following set of multi-party optimization problems:


where and are the training data (features and labels) from party . The regularization function and regularization tuning parameter

are used to improve prediction accuracy on high-dimensional data. Typically, the regularization function takes one of the following forms:

corresponding to Lasso () and Ridge () regression respectively. The estimated model can then be used to render a new prediction at a query point . It is worth noting that in some applications of LASSO (e.g., genomics [29]) the dimension can be larger than . However, in this work we focus on settings where is smaller than , and the real datasets and scenarios we use in our evaluation satisfy this property.

ADMM Alternating Direction Method of Multipliers (ADMM) [15] is an established technique for distributed convex optimization. To use ADMM, we first reformulate Eq. 1 by introducing additional variables and constraints:

such that: (2)

This equivalent formulation splits into for each party , but still requires that be equal to a global model . To solve this constrained formulation, we construct an augmented Lagrangian:


where the dual variables capture the mismatch between the model estimated by party and the global model and the augmenting term adds an additional penalty (scaled by the constant ) for deviating from .

The ADMM algorithm is a simple iterative dual ascent on the augmented Lagrangian of Eq. 2. On the iteration, each party locally solves this closed-form expression:


and then shares its local model and Lagrange multipliers to solve for the new global weights:


Finally, each party uses the new global weights to update its local Lagrange multipliers


The update equations (4), (5), and (6) are executed iteratively until all updates reach a fixed point. In practice, a fixed number of iterations may be used as a stopping condition, and that is what we do in Helen.

LASSO We use LASSO as a running example for the rest of the paper in order to illustrate how our secure training protocol works. LASSO is a popular regularized linear regression model that uses the norm as the regularization function. The LASSO formulation is given by the optimization objective . The boxed section below shows the ADMM training procedure for LASSO. Here, the quantities in color are quantities that are intermediate values in the computation and need to be protected from every party, whereas the quantities in black are private values known to one party.

The coopetitive learning task for LASSO Input of party : For , ADMMIterations-1:

is the soft the soft thresholding operator, where


The parameters and are public and fixed.

3 System overview

Figure 2: Architecture overview of Helen. Every red shape indicates secret information known only to the indicated party, and black indicates public information visible to everyone (which could be private information in encrypted form). For participant , we annotate the meaning of each quantity.

Figure 2 shows the system setup in Helen. A group of participants (also called parties) wants to jointly train a model on their data without sharing the plaintext data. As mentioned in Section 1, the use cases we envision for our system consist of a few large organizations (around organizations), where each organization has a lot of data ( is on the order of hundreds of thousands or millions). The number of features/columns in the dataset is on the order of tens or hundreds. Hence .

We assume that the parties have agreed to publicly release the final model. As part of Helen, they will engage in an interactive protocol during which they share encrypted data, and only at the end will they obtain the model in decrypted form. Helen supports regularized linear models including least squares linear regression, ridge regression, LASSO, and elastic net. In the rest of the paper, we focus on explaining Helen via LASSO, but we also provide update equations for ridge regression in Section 7.

3.1 Threat model

We assume that all parties have agreed upon a single functionality to compute and have also consented to releasing the final result of the function to every party.

We consider a strong threat model in which all but one party can be compromised by a malicious attacker. This means that the compromised parties can deviate arbitrarily from the protocol, such as supplying inconsistent inputs, substituting their input with another party’s input, or executing different computation than expected. In the flu prediction example, six divisions could collude together to learn information about one of the medical divisions. However, as long as the victim medical division follows our protocol correctly, the other divisions will not be able to learn anything about the victim division other than the final result of the function. We now state the security theorem.

Theorem 6.

Helen securely evaluates an ideal functionality in the -hybrid model under standard cryptographic assumptions, against a malicious adversary who can statically corrupt up to out of parties.

We formalize the security of Helen in the standalone MPC model. and are ideal functionalities that we use in our proofs, where is the ideal functionality representing the creation of a common reference string, and is the ideal functionality that makes a call to SPDZ. We present the formal definitions as well as proofs in Section B.2.

Out of scope attacks/complementary directions. Helen does not prevent a malicious party from choosing a bad dataset for the coopetitive computation (e.g., in an attempt to alter the computation result). In particular, Helen does not prevent poisoning attacks [46, 19]. MPC protocols generally do not protect against bad inputs because there is no way to ensure that a party provides true data. Nevertheless, Helen will ensure that once a party supplies its input into the computation, the party is bound to using the same input consistently throughout the entire computation; in particular, this prevents a party from providing different inputs at different stages of the computation, or mix-and-matching inputs from other parties. Further, some additional constraints can also be placed in pre-processing, training, and post-processing to mitigate such attacks, as we elaborate in Section 9.2.

Helen also does not protect against attacks launched on the public model, for example, attacks that attempt to recover the training data from the model itself [67, 17]. The parties are responsible for deciding if they are willing to share with each other the model. Our goal is only to conduct this computation securely: to ensure that the parties do not share their raw plaintext datasets with each other, that they do not learn more information than the resulting model, and that only the specified computation is executed. Investigating techniques for ensuring that the model does not leak too much about the data is a complementary direction to Helen, and we expect that many of these techniques could be plugged into a system like Helen. For example, Helen can be easily combined with some differential privacy tools that add noise before model release to ensure that the model does not leak too much about an individual record in the training data. We further discuss possible approaches in Section 9.3.

Finally, Helen does not protect against denial of service – all parties must participate in order to produce a model.

3.2 Protocol phases

We now explain the protocol phases at a high level. The first phase requires all parties to agree to perform the coopetitive computation, which happens before initializing Helen. The other phases are run using Helen.

Agreement phase. In this phase, the parties come together and agree that they are willing to run a certain learning algorithm (in Helen’s case, ADMM for linear models) over their joint data. The parties should also agree to release the computed model among themselves.

The following discrete phases are run by Helen. We summarize their purposes here and provide the technical design for each in the following sections.

Initialization phase. During initialization, the parties compute the threshold encryption parameters [34] using a generic maliciously secure MPC protocol like SPDZ [28]. The public output of this protocol is a public key PK that is known to everyone. Each party also receives a piece (called a share) of the corresponding secret key SK: party receives the -th share of the key denoted as . A value encrypted under PK can only be decrypted via all shares of the SK, so every party needs to agree to decrypt this value. Fig. 2 shows these keys. This phase only needs to run once for the entire training process, and does not need to be re-run as long as the parties’ configuration does not change.

Input preparation phase. In this phase, each party prepares its data for the coopetitive computation. Each party precomputes summaries of its data and commits to them by broadcasting encrypted summaries to all other parties. The parties also need to prove that they know the values inside these encryptions using zero-knowledge proofs of knowledge. From this moment on, party will not be able to use different inputs for the rest of the computation.

By default, each party stores the encrypted summaries from other parties. This is a viable solution since these summaries are much smaller than the data itself. It is possible to also store all summaries in a public cloud by having each party produce an integrity MAC of the summary from each other party and checking the MAC upon retrieval which protects against a compromised cloud.

Model compute phase. This phase follows the iterative ADMM algorithm, in which parties successively compute locally on encrypted data, followed by a coordination step with other parties using a generic MPC protocol.

Throughout this protocol, each party receives only encrypted intermediate data. No party learns the intermediate data because, by definition, an MPC protocol should not reveal any data beyond the final result. Moreover, each party proves in zero knowledge to the other parties that it performed the local computation correctly using data that is consistent with the private data that was committed in the input preparation phase. If any one party misbehaves, the other parties will be able to detect the cheating with overwhelming probability.

Model release phase. At the end of the model compute phase, all parties obtain an encrypted model. All parties jointly decrypt the weights and release the final model. However, it is possible for a set of parties to not receive the final model at the end of training if other parties misbehave (it has been proven that it is impossible to achieve fairness for generic MPC in the malicious majority setting [20]). Nevertheless, this kind of malicious behavior is easily detectable in Helen and can be enforced using legal methods.

4 Cryptographic Gadgets

Helen’s design combines several different cryptographic primitives. In order to explain the design clearly, we split Helen into modular gadgets. In this section and the following sections, we discuss (1) how Helen implements these gadgets, and (2) how Helen composes them in the overall protocol.

For simplicity, we present our zero knowledge proofs as -protocols, which require the verifier to generate random challenges. These protocols can be transformed into full zero knowledge with non-malleable guarantees with existing techniques [35, 33]. We explain one such transformation in Section B.2.

4.1 Plaintext-ciphertext matrix multiplication proof


Gadget 1.

A zero-knowledge proof for the statement: “Given public parameters: public key , encryptions , and ; private parameters: ,

  • , and

  • I know such that .”

Gadget usage We first explain how creftype 1 is used in Helen. A party in Helen knows a plaintext and commits to by publishing its encryption, denoted by . also receives an encrypted matrix and needs to compute by leveraging the homomorphic properties of the encryption scheme. Since parties in Helen may be malicious, other parties cannot trust to compute and output correctly. creftype 1 will help prove in zero-knowledge that it executed the computation correctly. The proof needs to be zero-knowledge so that nothing is leaked about the value of . It also needs to be a proof of knowledge so that proves that it knows the plaintext matrix .

Protocol Using the Paillier ciphertext multiplication proofs [24], we can construct a naïve algorithm for proving matrix multiplication. For input matrices that are , the naïve algorithm will incur a cost of since one has to prove each individual product. One way to reduce this cost is to have the prover prove that for a randomly chosen such that (where is a challenge from the verifier). For such a randomly chosen , the chance that the prover can construct a is exponentially small (see Theorem 3 for an analysis).

As the first step, both the prover and the verifier apply the reduction to get the new statement . To prove this reduced form, we apply the Paillier ciphertext multiplication proof in a straightforward way. This proof takes as input three ciphertexts: . The prover proves that it knows the plaintext such that , and that . We apply this proof to every multiplication for each dot product in . The prover then releases the individual encrypted products along with the corresponding ciphertext multiplication proofs. The verifier needs to verify that . Since the encrypted ciphers from the previous step are encrypted using Paillier, the verifier can homomorphically add them appropriately to get the encrypted vector . From a dot product perspective, this step will sum up the individual products computed in the previous step. Finally, the prover needs to prove that each element of is equal to each element of . We can prove this using the same ciphertext multiplication proof by setting .

4.2 Plaintext-plaintext matrix multiplication proof


Gadget 2.

A zero-knowledge proof for the statement: “Given public parameters: public key , encryptions , , ; private parameters: and ,

  • , and

  • I know , , and such that , , and .”

Gadget usage

This proof is used to prove matrix multiplication when the prover knows both input matrices (and thus the output matrix as well). The protocol is similar to the plaintext-ciphertext proofs, except that we have to do an additional proof of knowledge of .

Protocol The prover wishes to prove to a verifier that without revealing , or . We follow the same protocol as creftype 1. Additionally, we utilize a variant of the ciphertext multiplication proof that only contains the proof of knowledge component to show that the prover also knows . The proof of knowledge for the matrix is simply a list of element-wise proofs for . We do not explicitly prove the knowledge of because the matrix multiplication proof and the proof of knowledge for imply that the prover knows as well.

5 Input preparation phase

5.1 Overview

In this phase, each party prepares data for coopetitive training. In the beginning of the ADMM procedure, every party precomputes some summaries of its data and commits to them by broadcasting encrypted summaries to all the other parties. These summaries are then reused throughout the model compute phase. Some form of commitment is necessary in the malicious setting because an adversary can deviate from the protocol by altering its inputs. Therefore, we need a new gadget that allows us to efficiently commit to these summaries.

More specifically, the ADMM computation reuses two matrices during training: and from party (see Section 2.3 for more details). These two matrices are of sizes and , respectively. In a semihonest setting, we would trust parties to compute and correctly. In a malicious setting, however, the parties can deviate from the protocol and choose and that are inconsistent with each other (e.g., they do not conform to the above formulations).

Helen does not have any control over what data each party contributes because the parties must be free to choose their own and . However, Helen ensures that each party consistently uses the same and during the entire protocol. Otherwise, malicious parties could try to use different/inconsistent and at different stages of the protocol, and thus manipulate the final outcome of the computation to contain the data of another party.

One possibility to address this problem is for each party to commit to its in and in . To calculate , the party can calculate and prove using creftype 2, followed by computing a matrix inversion computation within SPDZ. The result can be repeatedly used in the iterations. This is clearly inefficient because (1) the protocol scales linearly in , which could be very large, and (2) the matrix inversion computation requires heavy compute.

Our idea is to prove using an alternate formulation via singular value decomposition (SVD) [40], which can be much more succinct: and can be decomposed using SVD to matrices that scale linearly in . Proving the properties of and using the decomposed matrices is equivalent to proving using and .

5.2 Protocol

5.2.1 Decomposition of reused matrices

We first derive an alternate formulation for (denoted as for the rest of this section). From fundamental linear algebra concepts we know that every matrix has a corresponding singular value decomposition [40]. More specifically, there exists unitary matrices and , and a diagonal matrix such that , where , , and . Since and thus are real matrices, the decomposition also guarantees that and are orthogonal, meaning that and . If is not a square matrix, then the top part of is a diagonal matrix, which we will call . ’s diagonal is a list of singular values . The rest of the matrix are ’s. If is a square matrix, then is simply . Finally, the matrices and

are orthogonal matrices. Given an orthogonal matrix

, we have that .

It turns out that has some interesting properties:

We now show that , where is the diagonal matrix with diagonal values .

Using a similar reasoning, we can also derive that

5.2.2 Properties after decomposition

The SVD decomposition formulation sets up an alternative way to commit to matrices and . For the rest of this section, we describe the zero knowledge proofs that every party has to execute. For simplicity, we focus on one party and use and to represent its data, and and to represent its summaries.

During the ADMM computation, matrices and are repeatedly used to calculate the intermediate weights. Therefore, each party needs to commit to and . With the alternative formulation, it is no longer necessary to commit to and individually. Instead, it suffices to prove that a party knows , , (all are in ) and a vector such that:

  1. ,

  2. ,

  3. is an orthogonal matrix, namely, , and

  4. is a diagonal matrix where the diagonal entries are . are the values on the diagonal of and is a public value.

Note that can be readily derived from by adding rows of zeros. Moreover, both and are diagonal matrices. Therefore, we only commit to the diagonal entries of and since the rest of the entries are zeros.

The above four statements are sufficient to prove the properties of and in the new formulation. The first two statements simply prove that and are indeed decomposed into some matrices , , , and . Statement 3) shows that is an orthogonal matrix, since by definition an orthogonal matrix has to satisfy the equation . However, we allow the prover to choose . As stated before, the prover would have been free to choose and anyway, so this freedom does not give more power to the prover.

Statement 4) proves that the matrix is a diagonal matrix such that the diagonal values satisfy the form above. This is sufficient to show that is correct according to some . Again, the prover is free to choose , which is the same as freely choosing its input .

Finally, we chose to commit to instead of committing to and separately. Following our logic above, it seems that we also need to commit to and prove that it is an orthogonal matrix, similar to what we did with . This is not necessary because of an important property of orthogonal matrices: ’s columns span the vector space . Multiplying , the result is a linear combination of the columns of . Since we also allow the prover to pick its , essentially can be any vector in . Thus, we only have to allow the prover to commit to the product of and . As we can see from the derivation, , but since is simply with rows of zeros, the actual decomposition only needs the first elements of . Hence, this allows us to commit to , which is .

Using our techniques, Helen commits only to matrices of sizes or , thus removing any scaling in (the number of rows in the dataset) in the input preparation phase.

5.2.3 Proving the initial data summaries

First, each party broadcasts , , , , , and . To encrypt a matrix, the party simply individually encrypts each entry. The encryption scheme itself also acts as a commitment scheme [41], so we do not need an extra commitment scheme.

To prove these statements, we also need another primitive called an interval proof. Moreover, since these matrices act as inputs to the model compute phase, we also need to prove that and are within a certain range (this will be used by creftype 4, described in Section 6.5). The interval proof we use is from [14], which is an efficient way of proving that a committed number lies within a certain interval. However, what we want to prove is that an encrypted number lies within a certain interval. This can be solved by using techniques from [27], which appends the range proof with a commitment-ciphertext equality proof. This extra proof proves that, given a commitment and a Paillier ciphertext, both hide the same plaintext value.

To prove the first two statements, we invoke creftype 1 and creftype 2. This allows us to prove that the party knows all of the matrices in question and that they satisfy the relations laid out in those statements.

There are two steps to proving statement 3. The prover will compute and prove it computed it correctly using Gadget 1

as above. The result should be equal to the encryption of the identity matrix. However, since we are using fixed point representation for our data, the resulting matrix could be off from the expected values by some small error.

will only be close to , but not equal to . Therefore, we also utilize interval proofs to make sure that is close to , without explicitly revealing the value of .

Finally, to prove statement 4, the prover does the following:

  1. The prover computes and releases because the prover knows and proves using Gadget 1 that this computation is done correctly.

  2. The prover computes , which anyone can compute because and are public. and can be multiplied together to get the summation of the plaintext matrices.

  3. The prover now computes and proves this encryption was computed correctly using Gadget 1.

  4. Similar to step 3), the prover ends this step by using interval proofs to prove that this encryption is close to encryption of the identity matrix.

6 Model compute phase

6.1 Overview

In the model compute phase, all parties use the summaries computed in the input preparation phase and execute the iterative ADMM training protocol. An encrypted weight vector is generated at the end of this phase and distributed to all participants. The participants can jointly decrypt this weight vector to get the plaintext model parameters. This phase executes in three steps: initialization, training (local optimization and coordination), and model release.

6.2 Initialization

We initialize the weights . There are two popular ways of initializing the weights. The first way is to set every entry to a random number. The second way is to initialize every entry to zero. In Helen, we use the second method because it is easy and works well in practice.

6.3 Local optimization

During ADMM’s local optimization phase, each party takes the current weight vector and iteratively optimizes the weights based on its own dataset. For LASSO, the update equation is simply , where is the matrix and is . As we saw from the input preparation phase description, each party holds encryptions of and . Furthermore, given and (either initialized or received as results calculated from the previous round), each party can independently calculate by doing plaintext scaling and plaintext-ciphertext matrix multiplication. Since this is done locally, each party also needs to generate a proof proving that the party calculated correctly. We compute the proof for this step by invoking creftype 1.

6.4 Coordination using MPC

After the local optimization step, each party holds encrypted weights . The next step in the ADMM iterative optimization is the coordination phase. Since this step contains non-linear functions, we evaluate it using generic MPC.

6.4.1 Conversion to MPC

First, the encrypted weights need to be converted into an MPC-compatible input. To do so, we formulate a gadget that converts ciphertext to arithmetic shares. The general idea behind the protocol is inspired by arithmetic sharing protocols [24, 28].


Gadget 3.

For parties, each party having the public key PK and a share of the secret key SK, given public ciphertext , convert into shares such that . Each party receives secret share and does not learn the original secret value .

Gadget usage Each party uses this gadget to convert and into input shares and compute the soft threshold function using MPC (in our case, SPDZ). We denote as the public modulus used by SPDZ. Note that all of the computation encrypted by ciphertexts are dong within modulo .

Protocol The protocol proceeds as follows:

  1. Each party generates a random value and encrypts it, where is a statistical security parameter. Each party should also generate an interval plaintext proof of knowledge of , then publish along with the proofs.

  2. Each party takes as input the published and compute the product with . The result is .

  3. All parties jointly decrypt to get plaintext .

  4. Party sets . Every other party sets .

  5. Each party publishes as well as an interval proof of plaintext knowledge.

6.4.2 Coordination

The ADMM coordination step takes in and , and outputs . The update requires computing the soft-threshold function (a non-linear function), so we express it in MPC. Additionally, since we are doing fixed point integer arithmetic as well as using a relatively small prime modulus for MPC (256 bits in our implementation), we need to reduce the scaling factors accumulated on during plaintext-ciphertext matrix multiplication. We currently perform this operation inside MPC as well.

6.4.3 Conversion from MPC

After the MPC computation, each party receives shares of and its MAC shares, as well as shares of and its MAC shares. It is easy to convert these shares back into encrypted form simply by encrypting the shares, publishing them, and summing up the encrypted shares. We can also calculate this way. Each party also publishes interval proofs of knowledge for each published encrypted cipher. Finally, in order to verify that they are indeed valid SPDZ shares (the specific protocol is explained in the next section), each party also publishes encryptions and interval proofs of all the MACs.

6.5 Model release

6.5.1 MPC conversion verification

Since we are combining two protocols (homomorphic encryption and MPC), an attacker can attempt to alter the inputs to either protocol by using different or inconsistent attacker-chosen inputs. Therefore, before releasing the model, the parties must prove that they correctly executed the ciphertext to MPC conversion (and vice versa). We use another gadget to achieve this.


Gadget 4.

Given public parameters: encrypted value , encrypted SPDZ input shares , encrypted SPDZ MACs , and interval proofs of plaintext knowledge, verify that

  1. , and

  2. are valid SPDZ shares and ’s are valid MACs on .

Gadget usage We apply creftype 4 to all data that needs to be converted from encrypted ciphers to SPDZ or vice versa. More specifically, we need to prove that (1) the SPDZ input shares are consistent with that is published from each party, and (2) the SPDZ shares for and are authenticated by the MACs.

Protocol The gadget construction proceeds as follows:

  1. Each party verifies that , and pass the interval proofs of knowledge. For example, and need to be within .

  2. Each party homomorphically computes , as well as .

  3. Each party randomly chooses , where is again a statistical security parameter, and publishes as well as an interval proof of plaintext knowledge.

  4. Each party calculates . Here we assume that .

  5. All parties participate in a joint decryption protocol to decrypt obtaining .

  6. Every party individually checks to see that is a multiple of . If this is not the case, abort the protocol.

  7. The parties release the SPDZ global MAC key .

  8. Each party calculates and .

  9. Use the same method in steps 26 to prove that .

The above protocol is a way for parties to verify two things. First, that the SPDZ shares indeed match with a previously published encrypted value (i.e., creftype 3 was executed correctly). Second, that the shares are valid SPDZ shares. The second step is simply verifying the original SPDZ relation among value share, MAC shares, and the global key.

Note that we cannot verify these relations by simply releasing the plaintext data shares and their MACs since the data shares correspond to the intermediate weights. Furthermore, the shares need to be equivalent in modulo , which is different from the Paillier parameter . Therefore, we use an alternative protocol to test modulo equality between two ciphertexts, which is the procedure described above in steps 2 to 6.

Since the encrypted ciphers come with interval proofs of plaintext knowledge, we can assume that . If two ciphertexts encrypt plaintexts that are equivalent to each other, they must satisfy that or . Thus, if we take the difference of the two ciphertexts, this difference must be . We could then run the decryption protocol to test that the difference is indeed a multiple of .

If , simply releasing the difference could still reveal extra information about the value of . Therefore, all parties must each add a random mask to . In step 3, ’s are generated independently by all parties, which means that there must be at least one honest party who is indeed generating a random number within the range. The resulting plaintext thus statistically hides the true value of with the statistical parameter . If , then the protocol reveals the difference between . This is safe because the only way to reveal is when an adversary misbehaves and alters its inputs, and the result is independent from the honest party’s behavior.

6.5.2 Weight vector decryption

Once all SPDZ values are verified, all parties jointly decrypt . This can be done by first aggregating the encrypted shares of into a single ciphertext. After this is done, the parties run the joint decryption protocol from [34] (without releasing the private keys from every party). The decrypted final weights are released in plaintext to everyone.

7 Extensions to Other Models

Though we used LASSO as a running example, our techniques can be applied to other linear models like ordinary least-squares linear regression, ridge regression, and elastic net. Here we show the update rules for ridge regression, and leave its derivation to the readers.

Ridge regression solves a similar problem as LASSO, except with regularization. Given a dataset where is the feature matrix and is the prediction vector, ridge regression optimizes . The update equations for ridge regression are:

The local update is similar to LASSO, while the coordination update is a linear operation instead of the soft threshold function. Elastic net, which combines and regularization, can therefore be implemented by combining the regularization terms from LASSO and ridge regression.

8 Evaluation

We implemented Helen in C++. We utilize the SPDZ library [1]

, a mature library for maliciously secure multi-party computation, for both the baseline and Helen. In our implementation, we apply the Fiat-Shamir heuristic to our zero-knowledge proofs 

[33]. This technique is commonly used in implementations because it makes the protocols non-interactive and thus more efficient, but assumes the random oracle model.

We compare Helen’s performance to a maliciously secure baseline that trains using stochastic gradient descent, similar to SecureML [56]. Since SecureML only supports two parties in the semihonest setting, we implemented a similar baseline using SPDZ [28]. SecureML had a number of optimizations, but they were designed for the two-party setting. We did not extend those optimizations to the multi-party setting. We will refer to SGD implemented in SPDZ as the “secure baseline” (we explain more about the SGD training process in Section 8.1). Finally, we do not benchmark Helen’s Paillier key setup phase. This can be computed using SPDZ itself, and it is ran only once (as long as the party configuration does not change).

8.1 Experiment setup

We ran our experiments on EC2 using r4.8xlarge instances. Each machine has 32 cores and 244 GiB of memory. In order to simulate a wide area network setting, we created EC2 instances in Oregon and Northern Virginia. The instances are equally split across these two regions. To evaluate Helen’s scalability, we used synthetic datasets that are constructed by drawing samples from a noisy normal distribution. For these datasets, we varied both the dimension and the number of parties. To evaluate Helen’s performance against the secure baseline, we benchmarked both systems on two real world datasets from UCI 


Training assumptions. We do not tackle hyperparameter tuning in our work, and also assume that the data has been normalized before training. We also use a fixed number of rounds (

) for ADMM training, which we found experimentally using the real world datasets. We found that rounds is often enough for the training process to converge to a reasonable error rate. Recall that ADMM converges in a small number of rounds because it iterates on a summary of the entire dataset. In contrast, SGD iteratively scans data from all parties at least once in order to get an accurate representation of the underlying distributions. This is especially important when certain features occur rarely in a dataset. Since the dataset is very large, even one pass already results in many rounds.

MPC configuration. As mentioned earlier, SPDZ has two phases of computation: an offline phase and an online phase. The offline phase can run independently of the secure function, but the precomputed values cannot be reused across multiple online phases. The SPDZ library provides several ways of benchmarking different offline phases, including MASCOT [48] and Overdrive [49]. We tested both schemes and found Overdrive to perform better over the wide area network. Since these are for benchmarking purposes only, we decided to estimate the SPDZ offline phase by dividing the number of triplets needed for a circuit by the benchmarked throughput. The rest of the evaluation section will use the estimated numbers for all SPDZ offline computation. Since Helen uses parallelism, we also utilized parallelism in the SPDZ offline generation by matching the number of threads on each machine to the number of cores available.

On the other hand, the SPDZ online implementation is not parallelized because the API was insufficient to effectively express parallelism. We note two points. First, while parallelizing the SPDZ library will result in a faster baseline, Helen also utilizes SPDZ, so any improvement to SPDZ also carries over to Helen. Second, as shown below, our evaluation shows that Helen still achieves significant performance gains over the baseline even if the online phase in the secure baseline is infinitely fast.

Finally, the parameters we use for Helen are: 128 bits for the secure baseline’s SPDZ configuration, 256 bits for the Helen SPDZ configuration, and 4096 bits for Helen’s Paillier ciphertext.

8.2 Theoretic performance

Baseline Secure SGD
Helen SVD decomposition
SVD proofs
MPC offline
Model compute
Table 1: Theoretical scaling (complexity analysis) for SGD baseline and Helen. is the number of parties, is the number of samples per party, is the dimension.

Table 1 shows the theoretic scaling behavior for SGD and Helen, where is the number of parties, is the number of samples per party, is the dimension, and and are constants. Note that ’s are not necessarily the same across the different rows in the table. We split Helen’s input preparation phase into three sub-components: SVD (calculated in plaintext), SVD proofs, and MPC offline (since Helen uses SPDZ during the model compute phase, we also need to run the SPDZ offline phase).

SGD scales linearly in and . If the number of samples per party is doubled, the number of iterations is also doubled. A similar argument goes for . SGD scales quadratic in because it first scales linearly in due to the behavior of the MPC protocol. If we add more parties to the computation, the number of samples will also increase, which in turn increases the number of iterations needed to scan the entire dataset.

Helen, on the other hand, scales linearly in only for the SVD computation. We emphasize that SVD is very fast because it is executed on plaintext data. The part of the SVD proofs formula scales linearly in because each party needs to verify from every other party. It also scales linearly in because each proof verification requires work. The part of the formula has scaling because our matrices are ), and to calculate a resulting encrypted matrix requires matrix multiplication on two matrices.

The coordination phase from Helen’s model compute phase, as well as the corresponding MPC offline compute phase, scale quadratic in because we need to use MPC to re-scale weight vectors from each party. This cost corresponds to the part of the formula. The model compute phase’s cost ( part of the formula) reflects the matrix multiplication and the proofs. The rest of the MPC conversion proofs scale linearly in and ( part of the formula).

(a) Helen’s scaling as we increase the number of dimensions. The number of parties is fixed to be 4, and the number of samples per party is .
(b) Helen’s two phases as we increase the number of parties. The dimension is set to be , and the number of samples per party is .
Figure 3: Helen scalability measurements.

8.3 Synthetic datasets

Samples per party 2000 4000 6000 8000 10K 40K 100K 200K 400K 800K 1M
sklearn L2 error 8937.01 8928.32 8933.64 8932.97 8929.10 8974.15 8981.24 8984.64 8982.88 8981.11 8980.35
Helen L2 error 8841.33 8839.96 8828.18 8839.56 8837.59 8844.31 8876.00 8901.84 8907.38 8904.11 8900.37
sklearn MAE 57.89 58.07 58.04 58.10 58.05 58.34 58.48 58.55 58.58 58.56 58.57
Helen MAE 57.23 57.44 57.46 57.44 57.47 57.63 58.25 58.38 58.36 58.37 58.40
Table 2: Select errors for gas sensor (due to space), comparing Helen with a baseline that uses sklearn to train on all plaintext data. L2 error is the squared norm; MAE is the mean average error. Errors are calculated after post-processing.
Samples per party 1000 2000 4000 6000 8000 10K 20K 40K 60K 80K 100K
sklearn L2 error 92.43 91.67 90.98 90.9 90.76 90.72 90.63 90.57 90.55 90.56 90.55
Helen L2 error 93.68 91.8 91.01 90.91 90.72 90.73 90.67 90.57 90.54 90.57 90.55
sklearn MAE 6.86 6.81 6.77 6.78 6.79 6.81 6.80 6.79 6.79 6.80 6.80
Helen MAE 6.92 6.82 6.77 6.78 6.79 6.81 6.80 6.79 6.80 6.80 6.80
Table 3: Errors for song prediction, comparing Helen with a baseline that uses sklearn to train on all plaintext data. L2 error is the squared norm; MAE is the mean average error. Errors are calculated after post-processing.

We want to answer two questions about Helen’s scalability using synthetic datasets: how does Helen scale as we vary the number of features and how does it scale as we vary the number of parties? Note that we are not varying the number of input samples because that will be explored in Section 8.4 in comparison to the secure SGD baseline.

Fig. 2(a) shows a breakdown of Helen’s cryptographic computation as we scale the number of dimensions. The plaintext SVD computation is not included in the graph. The SVD proofs phase is dominated by the matrix multiplication proofs, which scales in . The MPC offline phase and the model compute phase are both dominated by the linear scaling in , which corresponds to the MPC conversion proofs.

Fig. 2(b) shows the same three phases as we increase the number of parties. The SVD proofs phase scales linearly in the number of parties . The MPC offline phase scales quadratic in , but its effects are not very visible for a small number of parties. The model compute phase is dominated by the linear scaling in because the quadratic scaling factor isn’t very visible for a small number of parties.

Finally, we also ran a microbenchmark to understand Helen’s network and compute costs. The experiment used 4 servers and a synthetic dataset with 50 features and 100K samples per party. We found that the network costs account for approximately 2% of the input preparation phase and 22% of Helen’s model compute phase.

8.4 Real world datasets

We evaluate on two different real world datasets: gas sensor data [30] and the million song dataset [9, 30]. The gas sensor dataset records 16 sensor readings when mixing two types of gases. Since the two gases are mixed with random concentration levels, the two regression variables are independent and we can simply run two different regression problems (one for each gas type). For the purpose of benchmarking, we ran an experiment using the ethylene data in the first dataset. The million song dataset is used for predicting a song’s published year using 90 features. Since regression problems produce real values, the year can be calculated by rounding the regressed value.

For SGD, we set the batch size to be the same size as the dimension of the dataset. The number of iterations is equal to the total number of sample points divided by the batch size. Unfortunately, we had to extrapolate the runtimes for a majority of the baseline online phases because the circuits were too big to compile on our EC2 instances.

Fig. 4 and Fig. 6 compare Helen to the baseline on the two datasets. Note that Helen’s input preparation graph combines the three phases that are run during the offline: plaintext SVD computation, SVD proofs, and MPC offline generation. We can see that Helen’s input preparation phase scales very slowly with the number of samples. The scaling actually comes from the plaintext SVD calculation because both the SVD proofs and the MPC offline generation do not scale with the number of samples. Helen’s model compute phase also stays constant because we fixed the number of iterations to a conservative estimate. SGD, on the other hand, does scale linearly with the number of samples in both the offline and the online phases.

Figure 4: Helen and baseline performance on the gas sensor data. The gas sensor data contained over 4 million data points; we partitioned into 4 partitions with varying number of sample points per partition to simulate the varying number of samples per party. The number of parties is 4, and the number of dimensions is .
Figure 5: Helen and baseline performance on the song prediction data, as we vary the number of samples per party. The number of parties is 4, and the number of dimensions is .
Figure 6: Helen comparison with SGD

For the gas sensor dataset, Helen’s total runtime (input preparation plus model compute) is able to achieve a 21.5x performance gain over the baseline’s total runtime (offline plus online) when the number of samples is 1000. When the number of samples per party reaches 1 million, Helen is able to improve over the baseline by 20689x. For the song prediction dataset, Helen is able to have a 9.1x performance gain over the baseline when the number of samples is 1000. When the number of samples per party reaches 100K, Helen improves over the baseline by 911x. Even if we compare Helen to the baseline’s offline phase only, we find that Helen still has close to constant scaling while the baseline scales linearly with the number of samples. The performance improvement compared to the baseline offline phase is up to 1540x for the gas sensor dataset and up to 98x for the song prediction dataset.

In Table 2 and Table 3, we evaluate Helen’s test errors on the two datasets. We compare the L2 and mean average error for Helen to the errors obtained from a model trained using sklearn (a standard Python library for machine learning) on the plaintext data. We did not directly use the SGD baseline because its online phase does not compile for larger instances, and using sklearn on the plaintext data is a conservative estimate. We can see that Helen achieves similar errors compared to the sklearn baseline.

Work Functionality n-party? Malicious security? Practical?
Nikolaenko et al. [58] ridge regression no no
Hall et al. [43] linear regression yes no
Gascon et al. [36] linear regression no no
Cock et al. [21] linear regression no no
Giacomelli et al. [37] ridge regression no no
Alexandru et al. [5] quadratic opt. no no
SecureML [56]

linear, logistic, deep learning

no no
Shokri&Shmatikov [68] deep learning not MPC (heuristic) no
Semi-honest MPC [7] any function yes no
Malicious MPC [28, 39, 11, 2] any function yes yes no
Our proposal, Helen: regularized linear models yes yes yes
Figure 7: Insufficiency of existing cryptographic approaches. “n-party” refers to whether the (2) organizations can perform the computation with equal trust (thus not including the two non-colluding servers model). We answer the practicality question only for maliciously-secure systems. We note that a few works that we marked as not coopetitive and not maliciously secure discuss at a high level how one might extend their work to such a setting, but they did not flesh out designs or evaluate their proposals.

9 Related work

We organize the related work section into related coopetitive systems and attacks.

9.1 Coopetitive systems

Coopetitive training systems

In Fig. 7, we compare Helen to prior coopetitive training systems [58, 43, 36, 21, 37, 5, 56, 68]. The main takeaway is that, excluding generic

maliciously secure MPC, prior training systems do not provide malicious security. Furthermore, most of them also assume that the training process requires outsourcing to two non-colluding servers. At the same time, and as a result of choosing a weaker security model, some of these systems provide richer functionality than Helen, such as support for neural networks. As part of our future work, we are exploring how to apply Helen’s techniques to logistic regression and neural networks.

Other coopetitive systems Other than coopetitive training systems, there are prior works on building coopetitive systems for applications like machine learning prediction and SQL analytics. Coopetitive prediction systems [13, 64, 62, 53, 38, 47] typically consist of two parties, where one party holds a model and the other party holds an input. The two parties jointly compute a prediction without revealing the input or the model to the other party. Coopetitive analytics systems [6, 57, 12, 22, 10] allow multiple parties to run SQL queries over all parties’ data. These computation frameworks do not directly translate to Helen’s training workloads. Most of these works also do not address the malicious setting.

Trusted hardware based systems The related work presented in the previous two sections all utilize purely software based solutions. Another possible approach is to use trusted hardware [55, 23], and there are various secure distributed systems that could be extended to the coopetitive setting [66, 44, 73]. However, these hardware mechanisms require additional trust and are prone to side-channel leakages [51, 70, 52].

9.2 Attacks on machine learning

Machine learning attacks can be categorized into data poisoning, model leakage, parameter stealing, and adversarial learning. As mentioned in §3.1, Helen tackles the problem of cryptographically running the training algorithm without sharing datasets amongst the parties involved, while defenses against these attacks are orthogonal and complementary to our goal in this paper. Often, these machine learning attacks can be separately addressed outside of Helen. We briefly discuss two relevant attacks related to the training stage and some methods for mitigating them.

Poisoning Data poisoning allows an attacker to inject poisoned inputs into a dataset before training [46, 19]. Generally, malicious MPC does not prevent an attacker from choosing incorrect initial inputs because there is no way to enforce this requirement. Nevertheless, there are some ways of mitigating arbitrary poisoning of data that would complement Helen’s training approach. Before training, one can check that the inputs are confined within certain intervals. The training process itself can also execute cross validation, a process that can identify parties that do not contribute useful data. After training, it is possible to further post process the model via techniques like fine tuning and parameter pruning [54].

Model leakage Model leakage [67, 17] is an attack launched by an adversary who tries to infer information about the training data from the model itself. Again, malicious MPC does not prevent an attacker from learning the final result. In our coopetitive model, we also assume that all parties want to cooperate and have agreed to release the final model to everyone.

9.3 Differential privacy

One way to alleviate model leakage is through the use of differential privacy [45, 4, 32]. For example, one way to add differential privacy is to add carefully chosen noise directly to the output model [45]. Each party’s noise can be chosen directly using MPC, and the final result can then be added to the final model before releasing. In Helen, differential privacy would be added after the model is computed, but before the model release phase. However, there are more complex techniques for differential privacy that involve modification to the training algorithm, and integrating this into Helen is an interesting future direction to explore.

10 Conclusion

In this paper, we propose Helen, a coopetitive system for training linear models. Compared to prior work, Helen assumes a stronger threat model by defending against malicious participants. This means that each party only needs to trust itself. Compared to a baseline implemented with a state-of-the-art malicious framework, Helen is able to achieve up to five orders of magnitude of performance improvement. Given the lack of efficient maliciously secure training protocols, we hope that our work on Helen will lead to further work on efficient systems with such strong security guarantees.

11 Acknowledgment

We thank the anonymous reviewers for their valuable reviews, as well as Shivaram Venkataraman, Stephen Tu, and Akshayaram Srinivasan for their feedback and discussions. This research was supported by NSF CISE Expeditions Award CCF-1730628, as well as gifts from the Sloan Foundation, Hellman Fellows Fund, Alibaba, Amazon Web Services, Ant Financial, Arm, Capital One, Ericsson, Facebook, Google, Huawei, Intel, Microsoft, Scotiabank, Splunk and VMware.


  • [1] bristolcrypto/spdz-2: Multiparty computation with SPDZ, MASCOT, and Overdrive offline phases. https://github.com/bristolcrypto/SPDZ-2. Accessed: 2018-10-31.
  • [2] VIFF, the Virtual Ideal Functionality Framework. http://viff.dk/, 2015.
  • [3] Health insurance portability and accountability act, April 2000.
  • [4] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016), ACM, pp. 308–318.
  • [5] Alexandru, A. B., Gatsis, K., Shoukry, Y., Seshia, S. A., Tabuada, P., and Pappas, G. J. Cloud-based quadratic optimization with partially homomorphic encryption. arXiv preprint arXiv:1809.02267 (2018).
  • [6] Bater, J., Elliott, G., Eggen, C., Goel, S., Kho, A., and Rogers, J. Smcql: secure querying for federated databases. Proceedings of the VLDB Endowment 10, 6 (2017), 673–684.
  • [7] Ben-David, A., Nisan, N., and Pinkas, B. Fairplaymp: a system for secure multi-party computation. www.cs.huji.ac.il/project/Fairplay/FairplayMP.html, 2008.
  • [8] Ben-Or, M., Goldwasser, S., and Wigderson, A. Completeness theorems for non-cryptographic fault-tolerant distributed computation. In

    Proceedings of the twentieth annual ACM symposium on Theory of computing

    (1988), ACM, pp. 1–10.
  • [9] Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. The million song dataset. In Ismir (2011), vol. 2, p. 10.
  • [10] Bittau, A., Erlingsson, U., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., Rudominer, M., Kode, U., Tinnes, J., and Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles (2017), ACM, pp. 441–459.
  • [11] Bogdanov, D., Laur, S., and Willemson, J. Sharemind: A Framework for Fast Privacy-Preserving Computations. 2008.
  • [12] Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017), CCS ’17.
  • [13] Bost, R., Popa, R. A., Tu, S., and Goldwasser, S. Machine learning classification over encrypted data. In Network and Distributed System Security Symposium (NDSS) (2015).
  • [14] Boudot, F. Efficient proofs that a committed number lies in an interval. In International Conference on the Theory and Applications of Cryptographic Techniques (2000), Springer, pp. 431–444.
  • [15] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. In Foundations and Trends in Machine Learning, Vol. 3, No. 1 (2010).
  • [16] Canetti, R. Security and composition of cryptographic protocols: a tutorial (part i). ACM SIGACT News 37, 3 (2006), 67–92.
  • [17] Carlini, N., Liu, C., Kos, J., Erlingsson, Ú., and Song, D. The secret sharer: Measuring unintended neural network memorization & extracting secrets. arXiv preprint arXiv:1802.08232 (2018).
  • [18] Chen, H., and Xiang, Y. The study of credit scoring model based on group lasso. Procedia Computer Science 122 (2017), 677 – 684. 5th International Conference on Information Technology and Quantitative Management, ITQM 2017.
  • [19] Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017).
  • [20] Cleve, R. Limits on the security of coin flips when half the processors are faulty. In Proceedings of the eighteenth annual ACM symposium on Theory of computing (1986), ACM, pp. 364–369.
  • [21] Cock, M. d., Dowsley, R., Nascimento, A. C., and Newman, S. C. Fast, privacy preserving linear regression over distributed datasets based on pre-distributed data. In

    Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security (AISec)

  • [22] Corrigan-Gibbs, H., and Boneh, D. Prio: Private, robust, and scalable computation of aggregate statistics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (2017).
  • [23] Costan, V., and Devadas, S. Intel sgx explained. IACR Cryptology ePrint Archive 2016 (2016), 86.
  • [24] Cramer, R., Damgård, I., and Nielsen, J. Multiparty computation from threshold homomorphic encryption. EUROCRYPT 2001 (2001), 280–300.
  • [25] Damgård, I. Efficient concurrent zero-knowledge in the auxiliary string model. In International Conference on the Theory and Applications of Cryptographic Techniques (2000), Springer, pp. 418–430.
  • [26] Damgård, I. On -protocols. Lecture Notes, University of Aarhus, Department for Computer Science (2002).
  • [27] Damgård, I., and Jurik, M. Client/server tradeoffs for online elections. In International Workshop on Public Key Cryptography (2002), Springer, pp. 125–140.
  • [28] Damgård, I., Pastro, V., Smart, N., and Zakarias, S. Multiparty computation from somewhat homomorphic encryption. In Advances in Cryptology–CRYPTO 2012. Springer, 2012, pp. 643–662.
  • [29] D’Angelo, G. M., Rao, D. C., and Gu, C. C.

    Combining least absolute shrinkage and selection operator (lasso) and principal-components analysis for detection of gene-gene interactions in genome-wide association studies.

    In BMC proceedings (2009).
  • [30] Dheeru, D., and Karra Taniskidou, E. UCI machine learning repository, 2017.
  • [31] Dictionaries, E. O. Coopetition.
  • [32] Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy, data processing inequalities, and statistical minimax rates. arXiv preprint arXiv:1302.3203 (2013).
  • [33] Faust, S., Kohlweiss, M., Marson, G. A., and Venturi, D. On the non-malleability of the fiat-shamir transform. In International Conference on Cryptology in India (2012), Springer, pp. 60–79.
  • [34] Fouque, P.-A., Poupard, G., and Stern, J. Sharing decryption in the context of voting or lotteries. In International Conference on Financial Cryptography (2000), Springer, pp. 90–104.
  • [35] Garay, J. A., MacKenzie, P., and Yang, K. Strengthening zero-knowledge protocols using signatures. In Eurocrypt (2003), vol. 2656, Springer, pp. 177–194.
  • [36] Gascón, A., Schoppmann, P., Balle, B., Raykova, M., Doerner, J., Zahur, S., and Evans, D. Privacy-preserving distributed linear regression on high-dimensional data. Cryptology ePrint Archive, Report 2016/892, 2016.
  • [37] Giacomelli, I., Jha, S., Joye, M., Page, C. D., and Yoon, K. Privacy-preserving ridge regression with only linearly-homomorphic encryption. Cryptology ePrint Archive, Report 2017/979, 2017. https://eprint.iacr.org/2017/979.
  • [38] Gilad-Bachrach, R., Dowlin, N., Laine, K., Lauter, K., Naehrig, M., and Wernsing, J. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning (2016), pp. 201–210.
  • [39] Goldreich, O., Micali, S., and Wigderson, A. How to play any mental game. In Proceedings of the nineteenth annual ACM symposium on Theory of computing (1987), ACM, pp. 218–229.
  • [40] Golub, G. H., and Van Loan, C. F. Matrix computations, vol. 3. JHU Press, 2012.
  • [41] Groth, J. Homomorphic trapdoor commitments to group elements. IACR Cryptology ePrint Archive 2009 (2009), 7.
  • [42] Halevy, A., Norvig, P., and Pereira, F. The unreasonable effectiveness of data. IEEE Intelligent Systems 24, 2 (Mar. 2009), 8–12.
  • [43] Hall, R., Fienberg, S. E., and Nardi, Y. Secure multiple linear regression based on homomorphic encryption. In Journal of Official Statistics (2011).
  • [44] Hunt, T., Zhu, Z., Xu, Y., Peter, S., and Witchel, E. Ryoan: A distributed sandbox for untrusted computation on secret data. In OSDI (2016), pp. 533–549.
  • [45] Iyengar, R., Near, J. P., Song, D., Thakkar, O., Thakurta, A., and Wang, L. Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP), IEEE.
  • [46] Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., and Li, B. Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. arXiv preprint arXiv:1804.00308 (2018).
  • [47] Juvekar, C., Vaikuntanathan, V., and Chandrakasan, A. Gazelle: A low latency framework for secure neural network inference. CoRR abs/1801.05507 (2018).
  • [48] Keller, M., Orsini, E., and Scholl, P. Mascot: faster malicious arithmetic secure computation with oblivious transfer. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016), ACM, pp. 830–842.
  • [49] Keller, M., Pastro, V., and Rotaru, D. Overdrive: making spdz great again. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (2018), Springer, pp. 158–189.
  • [50] Kidd, A. C., McGettrick, M., Tsim, S., Halligan, D. L., Bylesjo, M., and Blyth, K. G.

    Survival prediction in mesothelioma using a scalable lasso regression model: instructions for use and initial performance using clinical predictors.

    BMJ Open Respiratory Research 5, 1 (2018).
  • [51] Kocher, P., Genkin, D., Gruss, D., Haas, W., Hamburg, M., Lipp, M., Mangard, S., Prescher, T., Schwarz, M., and Yarom, Y. Spectre attacks: Exploiting speculative execution. arXiv preprint arXiv:1801.01203 (2018).
  • [52] Lee, S., Shih, M.-W., Gera, P., Kim, T., Kim, H., and Peinado, M. Inferring fine-grained control flow inside sgx enclaves with branch shadowing. In 26th USENIX Security Symposium, USENIX Security (2017), pp. 16–18.
  • [53] Liu, J., Juuti, M., Lu, Y., and Asokan, N. Oblivious neural network predictions via minionn transformations. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017), ACM, pp. 619–631.
  • [54] Liu, K., Dolan-Gavitt, B., and Garg, S. Fine-pruning: Defending against backdooring attacks on deep neural networks. arXiv preprint arXiv:1805.12185 (2018).
  • [55] McKeen, F., Alexandrovich, I., Berenzon, A., Rozas, C. V., Shafi, H., Shanbhogue, V., and Savagaonkar, U. R. Innovative instructions and software model for isolated execution. HASP@ ISCA 10 (2013).
  • [56] Mohassel, P., and Zhang, Y. Secureml: A system for scalable privacy-preserving machine learning. IACR Cryptology ePrint Archive 2017 (2017), 396.
  • [57] Narayan, A., and Haeberlen, A. Djoin: Differentially private join queries over distributed databases. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (2012), OSDI’12.
  • [58] Nikolaenko, V., Weinsberg, U., Ioannidis, S., Joye, M., Boneh, D., and Taft, N. Privacy-preserving ridge regression on hundreds of millions of records. In Security and Privacy (SP), 2013 IEEE Symposium on (2013), IEEE, pp. 334–348.
  • [59] Nikolaenko, V., Weinsberg, U., Ioannidis, S., Joye, M., Boneh, D., and Taft, N. Privacy-preserving ridge regression on hundreds of millions of records. In Security and Privacy (SP), 2013 IEEE Symposium on (2013), IEEE, pp. 334–348.
  • [60] Paillier, P. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT (1999), pp. 223–238.
  • [61] Papachristou, C., Ober, C., and Abney, M. A lasso penalized regression approach for genome-wide association analyses using related individuals: application to the genetic analysis workshop 19 simulated data. BMC Proceedings 10, 7 (Oct 2016), 53.
  • [62] Riazi, M. S., Weinert, C., Tkachenko, O., Songhori, E. M., Schneider, T., and Koushanfar, F. Chameleon: A hybrid secure computation framework for machine learning applications. Cryptology ePrint Archive, Report 2017/1164, 2017. https://eprint.iacr.org/2017/1164.
  • [63] Robbins, H., and Monro, S. A stochastic approximation method. In Herbert Robbins Selected Papers. Springer, 1985, pp. 102–109.
  • [64] Rouhani, B. D., Riazi, M. S., and Koushanfar, F. Deepsecure: Scalable provably-secure deep learning. CoRR abs/1705.08963 (2017).
  • [65] Roy, S., Mittal, D., Basu, A., and Abraham, A. Stock market forecasting using lasso linear regression model, 01 2015.
  • [66] Schuster, F., Costa, M., Fournet, C., Gkantsidis, C., Peinado, M., Mainar-Ruiz, G., and Russinovich, M. Vc3: Trustworthy data analytics in the cloud using sgx. In Security and Privacy (SP), 2015 IEEE Symposium on (2015), IEEE, pp. 38–54.
  • [67] Shmatikov, V., and Song, C. What are machine learning models hiding?
  • [68] Shokri, R., and Shmatikov, V. Privacy-preserving deep learning. In CCS (2015).
  • [69] Stoica, I., Song, D., Popa, R. A., Patterson, D., Mahoney, M. W., Katz, R., Joseph, A. D., Jordan, M., Hellerstein, J. M., Gonzalez, J. E., et al. A berkeley view of systems challenges for ai. arXiv preprint arXiv:1712.05855 (2017).
  • [70] Van Bulck, J., Minkin, M., Weisse, O., Genkin, D., Kasikci, B., Piessens, F., Silberstein, M., Wenisch, T. F., Yarom, Y., and Strackx, R. Foreshadow: Extracting the keys to the intel sgx kingdom with transient out-of-order execution. In Proceedings of the 27th USENIX Security Symposium. USENIX Association (2018).
  • [71] Wang, X., Ranellucci, S., and Katz, J. Global-scale secure multiparty computation. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017), ACM, pp. 39–56.
  • [72] Yao, A. C. Protocols for secure computations. In Foundations of Computer Science, 1982. SFCS’08. 23rd Annual Symposium on (1982), IEEE, pp. 160–164.
  • [73] Zheng, W., Dave, A., Beekman, J. G., Popa, R. A., Gonzalez, J. E., and Stoica, I. Opaque: An oblivious and encrypted distributed analytics platform. In USENIX Symposium of Networked Systems Design and Implementation (NDSI) (2017), pp. 283–298.

Appendix A ADMM derivations

Ridge regression solves a similar problem as LASSO, except with L2 regularization. Given dataset where is the feature matrix and is the prediction vector, ridge regression optimizes . Splitting the weights into and , we have

We first find the augmented Lagrangian

where and are the primal weight vectors, and is the dual weight vector. To simply the equations, we replace with the scaled dual variable where . The update equations come out to