Background
Introduction
Machine Learning (ML) has many applications in the biomedical domain, such as medical diagnosis and personalized medicine. Biomedical datasets are typically characterized by high dimensionality, i.e. a high number of features such as lab test results or gene expression values, and low sample size, i.e. a small number of training examples corresponding to e.g. patients or tissue samples. Adding to these challenges, valuable training data is often split between parties (data owners) who cannot openly share the data because of privacy regulations and concerns. Due to these concerns, privacypreserving solutions, using techniques such as secure MultiParty Computation (MPC), become important so that this data can still be used to train ML models, perform a diagnosis, and in some cases even derive genomic diagnoses [jagadeesh2017deriving].
We tackle the problem of training a binary classifier on high dimensional gene expression data held by different data owners, while keeping the training data private. This work is directly inspired by Track 4 of the iDASH 2019 secure genome analysis competition
^{1}^{1}1http://www.humangenomeprivacy.org/2019/competitiontasks.html, accessed on Jan 19, 2020. The iDASH competition is a yearly international competition for participants to create and implement privacypreserving protocols for applications with genomic data. The goal is in evaluating the bestknown secure methods and advancing new techniques to solve realworld problems in handling genomic data. In the 2019 edition there were a total of four different tracks, where Track 4 invited participants to design MPC solutions for collaborative training of ML models originating from multiple data owners. One of the competition datasets consists of 470 training examples (records) with 17,814 numeric features, while the other consists of 225 training examples with 12,634 numeric features. An initial 5fold crossvalidation analysis in the clear, i.e. without any encryption, indicated that in both cases logistic regression (LR) models are capable of yielding the level of prediction accuracy expected in the competition, prompting us to investigate MPCbased protocols for secure LR training.The competition requirements implied the existence of multiple data owners who each send their training example(s) in an encrypted or secret shared form to data processors (computing nodes), as illustrated in Figure 1. The honestbutcurious data processors are not to learn anything about the data as they engage in computations and communications with each other. At the end, they disclose the trained classifier – in our case, the coefficients of the LR model – to the data owners. Since the data processors cannot learn anything about the values in the dataset, this implies that our protocol is applicable in a wide range of scenarios, independently of how the original data is split by ownership. Our protocol works in scenarios where the data is horizontally partitioned, i.e. when each data owner has different records of the data, such as data belonging to different patients. It also works in scenarios where the data is vertically partitioned, i.e. when each data owner has different features of the data, such as the expression values for different genes.
The main novelty points of our solution for private LR training over a distributed dataset are: (i) a new protocol for securely computing the activation function that avoids the use of fullfledged secure comparison protocols; (ii) a novel method for bit decomposing secret shared integers and bundling their instantiations; and (iii) several cryptographic engineering enhancements that together with the novel protocol for the activation function gave us the fastest privacypreserving LR implementation in the world when run in local area networks (LANs). In summary, we designed a concrete solution for fast secure training of a binary classifier over gene expression data that meets the strict security requirements of the iDASH 2019 competition. For our largest dataset, we train a model that requires over 7 billion secure multiplications and the training completes in about 26.9 seconds in a LAN.
This paper significantly expands over a preliminary version of this result [PRIML2019], presented at a Workshop without formal proceedings. In this version we have a formal description of all protocols, security proofs and improved running times.
We first discuss below our work as compared to others. In the Section Methods, we present preliminary information on MPC, describe the secure subprotocols that are building blocks for our secure LR training protocol, and finally describe the protocol itself. In the Section Results we describe details of our implementation and runtime results for the overall protocol and microbenchmarks for our secure activation function protocol. In the Section Discussion, we note possible future work to improve and extend our results, and finally in the Section Conclusions we present our summary remarks.
Related Work
A variety of efforts have previously been made to train LR classifiers in a privacypreserving way.
One scenario that was considered in previous works [bonte2018privacy, chen2018logistic, kim2018logistic] is the setting in which a data owner holds the data while another party (the data processor), such as a cloud service, is responsible for the model training. These solutions usually rely on homomorphic encryption, with the data owner encrypting and sending their data to the data processor who performs computations on the encrypted data without having to decrypt it.
When the data is held by multiple data owners, they can either execute an MPC protocol among themselves to train the model, or delegate the computation to a set of data processors that run a MPC protocol. It is the latter setting that we follow in this paper.
Existing MPC approaches to secure LR differ in the numerical optimization algorithms used for LR training and in the cryptographic primitives leveraged [el2012secure, mohassel2017secureml, nardi2012achieving, xie2016privlogit]. The SPARK protocol [el2012secure] uses additive homomorphic encryption (Paillier cryptosystem) and uses NewtonRaphson as the numerical optimization algorithm to find the values of the weights that maximize the loglikelihood. The SPARK protocol can use the actual logistic function without approximating it at the cost of the plaintext data being horizontally partitioned and seen by the data processors. The two protocols from [nardi2012achieving] rely on the NewtonRaphson method, both approximate the logistic function, and both use additive secret sharing. The first protocol includes the use of Yao’s garbled circuits to compute the approximation of the logistic function, while the second protocol uses a Taylor approximation and Euler’s method. The PrivLogit method [xie2016privlogit] uses Yao’s garbled circuits and Paillier encryption; their protocol uses the NewtonRaphson method and a constant Hessian approximation to speed up computation. However, this protocol relies on the plaintext data being horizontally partitioned and seen by the data processors, which, like the work in [el2012secure], would not align with the iDASH 2019 competition requirements. We also point out a protocol secure against active adversaries from SecureNN [wagh2019securenn] for computing a ReLu. While we compute a different function (clipped ReLu), we share a similar idea that using the most significant bit of an input can tell us the output of the function.
The work closest to ours is SecureML [mohassel2017secureml], which was the fastest protocol for privately training LR models based on secure MPC prior to our work. SecureML separates the data owners from the data processors, and uses minibatch gradient descent. The main novelty points of SecureML are a clipped ReLu activation function, a novel truncation protocol, and a combination of garbled circuits and secret sharing based MPC in order to obtain a good tradeoff between communication, computation and round complexities. The SecureML protocol is evaluated on a dataset with up to 5,000 features, while – to the best of our knowledge – the existing runtime evaluation of all other approaches for MPC based LR training is limited to 400 features or less [el2012secure, nardi2012achieving, xie2016privlogit]. Like our solution, the SecureML protocol is split into an offline and online phase (the offline phase can be executed before the inputs are known and is responsible for generating multiplication triples). The SecureML solution is based on two servers, while our solution is based on three servers, namely a party who precomputes socalled multiplication triples in the offline stage, and two parties who actively compute the final result. If we exclude the preprocessing/offline stage from SecureML and exclude the predistribution of triples in our solution, we are left with protocols that work in exactly the same setting. We compare the runtime of both solutions in the Section Results.
A preliminary version of this work appeared in a workshop without formal proceedings [PRIML2019]. This paper is a substantially longer and detailed description that includes security proofs, detailed comparison with the stateoftheart, and improved running times.
Methods
Logistic Regression
Logistic regression is a common Machine Learning algorithm for binary classification. The training data consists of training examples in which is an
dimensional numerical vector, containing the values of
input attributes for example , and is the ground truth class label. Each for is a real number value.As illustrated in Figure 2(a), we train a neuron to map the ’s to the corresponding ’s, correctly classifying the examples. The neuron computes a weighted sum of the inputs (the values of the weights are learned during training) and subsequently applies an activation function to it, to arrive at the output
. Note that, as is common in neural network training, we extend the input attribute vector with a dummy feature
which has value 1 for all’s. The traditionally used activation function for LR is the sigmoid function
. Since the sigmoid function requires division and evaluation of an exponential function, which are expensive operations to perform in MPC, we approximate it with the activation function from [mohassel2017secureml], which is shown in Figure 2(b).For training, we use the full gradient descent based algorithm shown in Protocol 1 to learn the weights for the LR model. On line 3, we choose not to use early stopping^{2}^{2}2This is a technique that uses a metric, such as the accuracy on a heldout validation data set, to check when a model starts to overfit and will then stop training at that point. because in that case the number of iterations would depend on the values in the training data, hence leaking information [nardi2012achieving]. Instead, we use a fixed number of iterations during training.
Our scenario
In the scenario considered in this work the data is not held by a single party that performs all the computation, but distributed by the data owners to the data processors in such way that each data processor does not have any information about the data in the clear. Nevertheless, the data processors would still like to compute a LR model without leaking any other information about the data used for the training. To achieve this goal, we will use techniques from MPC.
Our setup is illustrated in Figure 1. We have multiple data owners who each hold disjoint parts of the data that is going to be used for the training. This is the most general approach and covers the cases in which the data is horizontally partitioned (i.e. for each training sample , all the data for is held by one of the data owners), vertically partitioned (for each feature, the values of that feature for all training samples are held by one of the data owners), and even arbitrary partitions. There are two data processors who collaborate to train a LR model using secure MPC protocols, and a trusted initializer (TI) that predistributes correlated randomness to the data processors in order to make the MPC computation more efficient. The TI is not involved in any other part of the execution, and does not learn any data from the data owners or data processors.
We next present the security model that is used and several secure building blocks, so that afterwards we can combine them in order to obtain a secure LR training protocol.
Security Model
The security model in which we analyze our protocol is the Universal Composability (UC) framework [FOCS:Canetti01] as it provides the strongest security and composability guarantees and is the gold standard for analyzing cryptographic protocols nowadays. Here we will only give a short overview of the UC framework (for the specific case of twoparty protocols), and refer interested readers to the book of Cramer et al. [CDN2015] for a detailed explanation.
The main advantage of the UC framework is that the UC composition theorem guarantees that any protocol proven UCsecure can also be securely composed with other copies of itself and of other protocols (even with arbitrarily concurrent executions) while preserving its security. Such guarantee is very useful since it allows the modular design of complex protocols, and is a necessity for protocols executing in complex environments such as the Internet.
The UC framework first considers a real world scenario in which the two protocol participants (the data processors from Figure 1, henceforth denoted Alice and Bob) interact between themselves and with an adversary and an environment (that captures all activity external to the single execution of the protocol that is under consideration). The environment gives the inputs and gets the outputs from Alice and Bob. The adversary delivers the messages exchanged between Alice and Bob (thus modeling an adversarial network scheduling) and can corrupt one of the participants, in which case he gains the control over it. In order to define security, an ideal world is also considered. In this ideal world, an idealized version of the functionality that the protocol is supposed to perform is defined. The ideal functionality receives the inputs directly from Alice and Bob, performs the computations locally following the primitive specification and delivers the outputs directly to Alice and Bob. A protocol executing in the real world is said to UCrealize functionality if for every adversary there exists a simulator such that no environment can distinguish between: (1) an execution of the protocol in the real world with participants Alice and Bob, and adversary ; (2) and an ideal execution with dummy parties (that only forward inputs/outputs), and .
Functionality 
is parametrized by an algorithm for sampling the correlated randomness. Upon initialization, run , and deliver to Alice and to Bob. 
This work like the vast majority of the privacypreserving machine learning protocols in the literature considers honestbutcurious, static adversaries. In more detail, the adversary chooses the party that he wants to corrupt before the protocol execution and he also follows the protocol instructions (but tries to learn additional information). We consider the trusted initializer model, in which a trusted initializer functionality (described in Figure 3) predistributes correlated randomness to Alice and Bob. ^{3}^{3}3Using a setup assumption, like the trusted initializer, in the MPC protocol is a necessity in order to get UCsecurity [C:CanFis01, STOC:CLOS02]. Other possible setup assumption to achieve UCsecurity include: a common reference string [C:CanFis01, STOC:CLOS02, C:PeiVaiWat08], the availability of a publickey infrastructure [FOCS:BCNP04], the random oracle model [TCC:HofMul04, EPRINT:BDDMN17b], the existence of noisy channels between the parties [SBSEG:DMN08, JIT:DGMN13], and the availability of tamperproof hardware [EC:Katz07, ICITS:DowMulNil15]. A trusted initializer has been often used to enable highly efficient solutions both in the context of privacypreserving machine learning [AISec:CDNN15, david2015efficient, fritchman2018, IEEETDSC:CDHK+17, NeurIPS2019] as well as in other applications, e.g., [r99, dowsley2010two, IEICE:DMOHIN11, ishai2013power, IJIS:TNDMIHO15, IEEEIFS:DDGM+16].
Simplifications: In our proofs the simulation strategy is simple and will be described briefly: all the messages look uniformly random from the recipient’s point of view, except for the messages that open a secret shared value to a party, but these ones can be easily simulated using the output of the respective functionalities. Therefore a simulator , having the leverage of being able to simulate the trusted initializer functionality in the ideal world, can easily perform a perfect simulation of a real protocol execution; therefore making the real and ideal worlds indistinguishable for any environment . In the ideal functionalities the messages are public delayed outputs, meaning that the simulator is first asked whether they should be delivered or not (this is due to the modeling that the adversary controls the network scheduling). This fact as well as the session identifications are omitted from our functionalities’ descriptions for the sake of readability.
Secret Sharing Based Secure MultiParty Computation
Our MPC solution is based on additive secret sharing over a ring . When secret sharing a value , Alice and Bob receive shares and , respectively, that are chosen uniformly at random in with the constraint that . We denote the pair of shares by . All computations are modulo and the modular notation is henceforth omitted for conciseness. Note that no information of the secret value is revealed to either party holding only one share. The secret shared value can be revealed/opened to each party by combining both shares. Some operations on secret shared values can be computed locally with no communication. Let , be secret shared values and be a constant. Alice and Bob can perform the following operations locally:

Addition (): Each party locally adds its local shares of and in order to obtain a share of . This will be denoted by .

Subtraction (): Each party locally subtracts its local share of from that of in order to obtain a share of . This will be denoted by .

Multiplication by a constant (): Each party multiplies its local share of by to obtain a share of . This will be denoted by

Addition of a constant (): Alice adds to her share of to obtain , while Bob sets . This will be denoted by .
Functionality 
runs with Alice and Bob and is parametrized by the size of the ring and the dimensions and of the matrices. 
Input: Upon receiving a message from Alice/Bob with its shares of and , verify if the share of is in and the share of is in . If it is not, abort. Otherwise, record the shares, ignore any subsequent message from that party and inform the other party about the receipt. 
Output: Upon receipt of the shares from both parties, reconstruct and from the shares, compute and create a secret sharing to distribute to Alice and Bob: a corrupt party fixes its share of the output to any chosen matrix and the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
The secure multiplication of secret shared values (i.e., ) cannot be done locally and involves communication between Alice and Bob. To obtain an efficient secure multiplication solution, we use the multiplication triples technique that was originally proposed by Beaver [beaver1997commodity]. We use a trusted initializer to predistribute the multiplication triples (which are a form of correlated randomness) to Alice and Bob. We use the same protocol for secure (matrix) multiplication of secret shared values as in [IEEETDSC:CDHK+17, Dowsley16] and denote by the protocol for the special case of multiplication of scalars and for the inner product. As shown in [IEEETDSC:CDHK+17] the protocol (described in Protocol 2) UCrealizes the distributed matrix multiplication functionality (described in Figure 4) in the trusted initializer model.
Converting to FixedPoint Representation
Each data owner initially needs to convert their training data to integers modulo so that they can be secret shared. As illustrated in Figure 5, each feature value is converted into a fixed point approximation of using a two’s complement representation for negative numbers. We define this new value as . This conversion is shown in Equation (1):
(1) 
Specifically, when we convert into its bit representation, we define the first bits from the right to hold the fractional part of , and the next bits to represent the nonnegative integer part of , and the most significant bit (MSB) to represent the sign (positive or negative). We define to represent the total number of bits such that the ring size is defined as . It is important to choose a that is large enough to represent the largest number that can be produced during the LR protocol, and therefore should be chosen to be at least (see Truncation). It is also important to choose a that is large enough to represent the maximum possible value of the integer part of all ’s (this is dependent on the data). This conversion and bit representation is shown in Figure 5.
Truncation
When multiplying numbers that were converted into a fixed point representation with fractional bits, the resulting product will end up with more bits representing the fractional part. For example, a fixed point representation of and , for , is and , respectively. The multiplication of both these terms results in , showing that now bits are representing the fractional part, which we must scale back down to to do any further computations. In our solution, we use the twoparty local truncation protocol for fixed point representations of real numbers proposed in [mohassel2017secureml] that we will refer to as
. It does not involve any messages between the two parties, each party simply performs an operation on its own local share. This protocol almost always incurs an error of at most a bit flip in the leastsignificant bit. However, with probability
, where is the number of fractional bits, the resulting value is completely random.When this truncation protocol is performed on increasingly large data sets (in our case we run over 7 billion secure multiplications), the probability of an erroneous truncation becomes a real issue – an issue not significant in previous implementations. There are two phases in which truncation is performed: (1) when computing the dot product (inner product) of the current weights vector with a training example in line 7 of Protocol 1, and (2) when the weight differentials () are adjusted in line 9 of Protocol 1. If a truncation error occurs during (1), the resulting erroneous value will be pushed into a reasonable range by the activation function and incur only a minor error for that round. If the error occurs during (2), an element of the weights vector will be updated to a completely random ring element and recovery from this error will be impossible. To mitigate this in experiments, we make use of 1012 bits of fractional precision with a ring size of 64 bits, making the probability of failure . The number of truncations that need to be performed was also reduced in our implementation by waiting to perform truncation until it is absolutely required. For instance, instead of truncating each result of multiplication between an attribute and its corresponding weight, a single truncation can be performed at the end of the entire dot product. Additional error is incurred on the accuracy by the fixed point representation itself. Through crossvalidation with an intheclear implementation, we determined that 12 bits of fractional precision provide enough accuracy to make the output accuracy indistinguishable.
Functionality 
runs with Alice and Bob and is parametrized by the bitlength of the value being converted from additive sharings in to additive bitwise sharings in such that . 
Input: Upon receiving a message from Alice or Bob with its share of , record the share, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, reconstruct the value from the shares, and for distribute new sharings of the bit . Before the output deliver, the corrupt party fix its shares of the output to any desired value. The shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraints. 
Functionality is parametrized by the bitlength of the ring in which the output is shared. 
Input: Upon receiving a message from Alice/Bob with her/his share of , record the share, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, reconstruct , then create and distribute to Alice and Bob the secret sharing . Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
Conversion of Sharings
For efficiency reasons, in some of the steps for securely computing the activation function we use secret sharings over , while in others we use secret sharings over . Therefore we need to be able to convert between the two types of secret sharings.
We use the twoparty protocol from [IEEETDSC:CDHK+17] for performing the bitdecomposition of a secretshared value to shares , where is the binary representation of . It works like the ripple carry adder arithmetic circuit based on the insight that the difference between the sum of the two additive shares held by the parties and an “XORsharing” of that sum is the carry vector. As proven in [IEEETDSC:CDHK+17], the bitdecomposition protocol (described in Protocol 3) UCrealizes the bitdecomposition functionality described in Figure 6.
In our implementation we use a highly parallelized and optimized version of the bitdecomposition protocol in order to improve the communication efficiency of the overall solution. The optimizations are described in the Appendix.
The opposite of a secure bitdecomposition is converting from bit sharing to an additive sharing over a larger ring. In our secure activation function protocol, we require securely converting a bit sharing to an additive sharing in . This is done using the protocol from [NeurIPS2019] (described in Protocol 4) that UCrealizes the secret sharing conversion functionality described in Figure 7.
Functionality 
Input: Upon receiving a message from Alice/Bob with her/his share of , record the share, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, reconstruct , compute the result of the activation function , and then create and distribute to Alice and Bob the secret sharing (using the fixedpoint representation). Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
Secure Activation Function
We propose a new protocol that evaluates from Figure 2(b) directly over additive shares and does not require full secure comparisons, which would have been more expensive. Instead of doing straightforward comparisons between , and , we derive the result through checking two things: (i) whether is positive or negative; (ii) whether . Both checks can be performed without using a full comparison protocol.
When is bit decomposed, the most significant bit is 0 if is nonnegative and 1 if is negative. In fact, if out of the bits, the lowest bits are used to represent the fractional component and the next bits are used to represent the integer component, then the remaining bits all have the same value as the most significant bit. We will use this fact in order to optimize the protocol by only performing a partial bitdecomposition and deducting whether is positive or negative from the th bit.
In the case that is negative, the output of is . But, if is positive, we need to determine whether in order to know if the output of should be fixed to or to . A positive is such that if and only if at least one of the bits corresponding to the integer component of representation is equal to 1, therefore we only need to analyze those bits to determine if .
Our secure protocol is described in Protocol 5. The AND operation corresponds to multiplications in . By the application of De Morgan’s law, the OR operation is performed using the AND and negation operations. The successive multiplications can be optimized to only take a logarithmic number of rounds by using wellknown techniques.
The activation function protocol UCrealizes the activation function functionality described in Figure 8. The correctness can be checked by inspecting the three possible cases: (i) if , then and (since at least one of the bits representing the integer component of will have a value 1). The output is thus (the fixedpoint representation of 1); if , then and , and therefore the output will be , which is the fixedpoint representation of ; if , then and the output will be a secret sharing representing zero as expected. The security follows trivially from the UCsecurity of the building blocks used and the fact that no secret sharing is opened.
Secure Logistic Regression Training
We now present our secure LR training protocol that uses a combination of the previously mentioned building blocks.
Notice that in the full gradient descent technique described in Protocol 1, the only operations that cannot be performed fully locally by the data processors, i.e. on their own local shares, are:

The computation of the inner product in line 7

The activation function in line 7

The multiplication of with in line 9
Our secure logistic regression training protocol (described in Protocol 6) shows how the secure building blocks described before can be used to securely compute these operations. The inner product is securely computed using on line 5, and since this involves multiplication on numbers that are scaled to a fixedpoint representation, we truncate the result using . The activation function is securely computed using on line 6. The multiplication of with is done using batch on line 11. Since this also involves multiplication on numbers that are scaled, the result is truncated using in line 14. A slight difference between the full gradient descent technique described in Protocol 1 and our protocol , is that instead of updating after every evaluation of the activation function, we batch together all activation function evaluations before computing the . Since the activation function requires a bitdecomposition of the input, we can now make use of the efficient batch bitdecomposition protocol batch (see Appendix) within the activation function protocol .
The logistic regression training protocol UCrealizes the logistic regression training functionality described in Figure 9. The correctness is trivial and the security follows straightforwardly from the UCsecurity of the building blocks used in .
Functionality 
Input: Upon receiving a message from Alice/Bob with her/his shares of for the set of training examples , record the shares, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, locally perform the same computational steps as using the secret sharings. Let be the resulting vector. Before the deliver of the output shares, a corrupt party can fix the shares that it will get, in which case the other shares are adjusted accordingly to still sum to . The output shares are delivered to the parties. 
The following steps describe endtoend how to securely train a LR classifier:

The TI sends the correlated randomness needed for efficient secure multiplication to the data processors. Note that while our current implementation has the TI continuously sending the correlated randomness, it is possible for the TI to send all correlated randomness as the first step, and therefore can leave and not be involved during the rest of the protocol.

Each data owner converts the values in the set of training examples that it holds to a fixedpoint representation as described in Equation 1. Each value is then split into two shares, which are then sent to the data processor 1 and data processor 2 respectively.

Each data processor receives the shares of data from the data owners. They now have secret sharings of the set of training examples . The learning rate and number of iterations are predetermined and public to both data processors.

The data processors collaborate to train the LR model. They both follow the secure logistic regression training protocol .

At the end of the protocol, each data processor will hold shares of the model’s weights . Each data processor sends their shares to all of the data owners, who can then combine the shares to learn the weights of the logistic regression model.
Cryptographic Engineering Optimizations
Sockets and Threading
A single iteration of the LR protocol is highly parallelizable in three distinct segments: (1) computing the dot products between the current weights and the dataset, (2) computing the activation of each dot product result, and (3) computing the gradient and updating the weights. In each of these phases, a large number of computations are required, but none have dependencies on others. We take advantage of this by completing each of these phases with thread pools that can be configured for the machine running the protocol. With Rust’s ownership concept, it is possible to yield results from threads without message passing or reallocation. Hence, the code is constructed to transfer ownership of results at each phase back to the main thread to avoid as much interprocess communication as possible. Additionally, all threads complete socket communications by computing all intermediate results directly in the socket buffer by implementing the buffer as a union of byte array and unsigned 64bit integer array. This buffer is allocated on the stack by each thread which circumvents the need for a shared memory block while also avoiding slower heap memory. The implementation of this configuration reduced running times significantly based on our trials. Further, all modular arithmetic operations are handled implicitly with the Rust API’s Wrapping struct which tells the ALU to ignore integer overflow. As long as the size of the ring over which the MPC protocols are performed is selected to align with a provided primitive bit width (i.e. 8, 16, 32, 64, 128) it is possible to omit computing the remainder of arithmetic with this construction.
Results
# features  # pos.  # neg.  # of  5fold CV  avg.  

samples  samples  of iterations  accuracy  runtime  
BCTCGA  17,814  422  48  10  99.58%  2.52 sec 
GSE2034  12,634  142  83  223  64.82%  26.90 sec 
BCTCGA training  GSE2034 training  activation function  

(online)  (online)  (one evaluation)  
Our work  2.52 sec  26.90 sec  0.030 ms 
SecureML  12.73 sec  49.95 sec  0.057 ms 
# evaluations  avg. runtime  runtime per activation 

(runtime/#eval)  
9 ms  0.035 ms  
16 ms  0.031 ms  
30 ms  0.029 ms  
59 ms  0.028 ms 
We implemented the protocols from the methods section in Rust^{4}^{4}4https://bitbucket.org/uwtppml/idash2019 and experimentally evaluated them on the BCTCGA and GSE2034 datasets of the iDASH 2019 competition. Both datasets contain gene expression data from breast cancer patients which are normal tissue/nonrecurrence samples (negative) or breast cancer tissue/recurrence tumor samples (positive) [xie2016comparison]. We trained LR models on both datasets with a learning rate . We use a fixed number of iterations for each dataset: 10 iterations for the BCTCGA dataset and 223 iterations for the GSE2034 dataset. The accuracy of the resulting models, evaluated with 5fold crossvalidation is presented in Table 1, along with the average runtime for training those models. It is important to note that these are the same accuracies that are obtained when training in the clear, i.e. there is no accuracy loss in the secure version.
We used integer precision , fractional precision and ring size (these choices were made based on experiments in the clear as mentioned in the previous section). We ran the experiments on AWS c5.9xlarge machines with 36 vCPUs, 72.0 GiB Memory. Each of the parties ran on separate machines (connected with a Gigabit Ethernet network), which means that the results in Table 1 cover communication time in addition to computation time. The results show that our implementation allows to securely train models with stateoftheart accuracy [xie2016comparison] on the BCTCGA and GSE2034 datasets within about 2.52 seconds and 26.90 seconds respectively.
A previous version of this implementation was submitted to the iDASH 2019 Track 4 competition. 9 of the 67 teams who entered Track 4 completed the challenge. Our solution was one of the 3 solutions who tied for the first place. Our implementation trained on all of the features for both datasets (no feature engineering is done), and generated a model that gave the highest accuracy, with runtimes that were well within the competition’s limit of 24 hours. The implementation presented in the current work is further optimized in relation to the iDASH version and achieves far better runtimes.
We note that while SecureML differs from our work in their setup and cryptographic primitives, it shares many similarities to ours and reports a fast runtime such that we find it valuable as a standard to compare to. While SecureML does not originally use a TI to predistribute the multiplication triples, it would be easy to adapt their result to use a TI for that purpose. Therefore, in order to have a fair comparison, we compare our protocol runtime against only their online runtime (thus excluding their offline runtime). We evaluated our implementation’s runtime against SecureML’s implementation by running their implementation on the same AWS machines using the same datasets (see Table 2 for runtime comparisons). For both datasets, our online phase runs faster than SecureML’s online phase which trains BCTCGA in 12.73 seconds and GSE2034 in 49.95 seconds.
We then compare online microbenchmark computation times. For the computation of the activation function, our run of the SecureML code reported around 0.057 ms to 0.059 ms for 1 activation, while our implementation completes 1024 evaluations in around 30 ms (0.029 ms per activation function). This makes our secure activation function implementation nearly twice as fast as SecureML’s. Additionally, it eliminates the overhead of switching between Yao gates and additive secret sharing. Furthermore, our activation function runs more efficiently (per evaluation) the more evaluations of it need to be computed, due to the design of the batch bitdecomposition protocol. This is illustrated in Table 3 where the calculated runtime per evaluation (runtime divided by number of evaluations) decreases as the number of evaluations increase.
Discussion
Our runtime experiments on securely training a LR model show that it is feasible to train on data that includes a large number of attributes, as is common with genomic data. Given the high dimensionality of the genomic data, an interesting direction for future work would be the design of MPC protocols for privacypreserving feature reduction. If any kind of feature reduction is used, it would result in a decrease in secure training runtime with a possibility for a slight decrease in the accuracy. We demonstrate this by choosing (in the clear) 54 features of the BCTCGA dataset that were part of the 76gene signature described in
[wang2005gene]. Training on these 54 features, we get a 5 fold crossvalidation accuracy of (training on all features produced ), and the average secure training time (of three runs) is 0.51 seconds, which is about a 2 second decrease from training on all 17,814 features. The genes in the GSE2034 dataset are not labeled in a way where we can map them to the 76gene signature to test the accuracy for a reduced number of features, but we test the runtime of training on 76 attributes and we get an average of 6.71 seconds, which is about a 20 second decrease from training on all 12,634 features. This shows that if feature reduction can be performed, runtimes can be improved while still being able to produce an accurate trained model.Our main contribution is the proposal of the fastest implementation and protocol for privacypreserving training of logistic regression models. Our novelty points are the new protocol for privately evaluating the activation function which can be computed using only additive shares and MPC protocols, without using a protocol for secure comparison. We use as an approximation of the sigmoid function since that is what is traditionally used in LR training, but is also used as an activation function in neural networks. Therefore, our fast secure protocol for computing can also result in faster neural network training. While training neural networks are out of the scope of this paper, we note that our results can be applicable to those types of ML models as well.
Conclusions
In this paper, we have described a novel protocol for implementing secure training of LR over distributed parties using MPC. Our protocol and implementation present several novel points and optimizations compared to existing work, including: (i) a novel protocol for computing the activation function that avoids the use of fullfledged secure comparison protocols; (ii) a series of cryptographic engineering optimizations to improve the performance.
With our implementation, we can train on the BCTCGA dataset with 17,814 features and 375 samples with 10 iterations in 2.52 seconds, and we can train on the GSE2034 dataset with 12,634 features and 179 samples with 223 iterations in 26.90 seconds. A less optimized version of this implementation won first place at the iDASH 2019 Track 4 competition when considering accuracy and efficiency. Our solution is particularly efficient for LANs where we can perform 1024 secure computations of the activation function in about 30 ms. To the best of our knowledge, ours is the fastest protocol for privately training logistic regression models over local area networks.
List of abbreviations
ML: Machine Learning; MPC: MultiParty computation; LR: Logistic regression; UC: Universal composability; MSB: Most significant bit; TI: Trusted initializer; LAN: Local area network
Appendix
Optimization of
Overview and Previous Work
The functionality (described in Methods) is easily realized as an adder circuit that takes as inputs each bit of the additive shares of a secret sharing in a large ring and outputs an “XORsharing” of the secret . First, each party regards its share of , denoted , as an XORshared secret and passes it to the adder circuit. The adder circuit then computes the carry vector which accounts for the rollover of binary addition. Adding this vector to all bitwise shares resolves the difference between and the bitdecomposed secret .
Naively, this carry vector can be obtained with linear communication complexity by means of ripple carry addition, as is described in Protocol 3. But, it is possible to achieve logarithmic communication complexity and even constant complexity [Toft2009ConstantRoundsAB] (though with worse performance than the logarithmic version for all reasonable bit lengths).
The highest performing realization of for realistic bit lengths is based on a speculative adder circuit [IEEETDSC:CDHK+17] in which at each layer the next set of carry bits are computed twice; once for each case that the previous carry bit had been 0 and 1. This protocol has rounds of communication and requires a total data transfer of bits.
We propose a new, highly optimised protocol based on a matrix composition network that reduces the number of communication rounds by 1 (or 2, in special cases) and requires a small fraction of the aforementioned data transfer cost.
Matrix composition network
To sum the binary numbers and , the th bit is given by , where In an alternate view, the carry can be seen to depend on two signals which in turn depend on and . () creates a new carry bit at the th position, and () perpetuates the previous carry bit, if it exists. In this representation, and . This sumofproducts form of the expression for lends itself to a matrix representation
When matrices in the form of are composed, the lower entries remain unchanged. This implies that
Therefore, to compute all , it is sufficient to compute the set of all matrix compositions
Note that it is not necessary to compute the th carry bit because depends on . Treating the carryin to the 1st bit as the vector , all can be derived implicitly from the upper righthand entry of (here, denotes the matrix composed of all matrices through , consecutively).
From the MPC perspective, this matrix composition requires two multiplications: and as seen in the equation below. The OR operation (+), which usually requires multiplication in MPC, is reduced to XOR based on the observation that and cannot both be true for a given .
The entire set of matrix compositions can be realized in a logarithmic depth network by, at the th layer, computing all compositions that require fewer than compositions. To set up conditions to allow us to minimize the total data transfer, the constraint is added that each should be the composition of the “largest” matrix from the previous layer, , with the remainder . If doesn’t exist in the network, it is added recursively following the same set of constraints.
Figure 10 shows an example with . This network is hereafter referred to as where is the highest order bit to decompose. The protocol description that follows considers only the case where , though the protocol functions the same for any . For instance, in Protocol 3, when using to find the of a secret, it is sufficient to set .
Efficiency discussion
The setup phase prior to the call to requires multiplications over to compute all . This corresponds to one communication round and bits of data transfer.
A call to has communication complexity corresponding to the depth of the network, , and multiplications over per layer, with fewer on the final layer when is not a power of 2. However, due to the fact that the matrices at each node of are reused extensively and known to not change value, the Beaver Triples used to mask the matrices can be desgined to contain redundancies to minimise the data transfer at each layer [mohassel2017secureml]. By reusing correlated randomness where information leakage is not possible, only masks need to be transferred at depth , for . At depth 0, there are masks; one for each matrix. Each matrix mask is 2 bits (one for each of the and bits), so the total data transfer is .
The recombination phase after is computed has only local computations and thus contributes nothing to the complexity.
Combining all phases, we see that has a communication cost of and a total data transfer cost of bits. Comparing with the speculative adder’s performance, the number of communication rounds is decreased by 1 in all cases and 2 in the case that is a power of 2. The total data transfer cost has roughly the data transfer rate of the previous work at . For higher all bit lengths, the ratio quickly converges near .
Implementation and Batching
can be implemented efficiently as a set of index pairs that correspond to the positions of the and bits that need to be combined at each layer. Once per layer, all products , can be computed in a single call to by taking the bitwise product between the concatenations , and splitting the result.
Extending to the case that many values need to be bit decomposed at the same time (as in Protocol 6), a vector of inputs can be decomposed “in parallel” by taking vertical slices over the and bits of each element and repacking them into a transposed form. In this way, each layer of can operate on a vector of matrices (represented as two lists of bit slices) to produce a vector of matrix compositions. This method has no effect on the number of rounds of communication and the total data transfer scales linearly with the length of the input vector.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
The genomic dataset was available upon request during the iDASH 2019 competition.
Competing interests
The authors declare that they have no competing interests.
Funding
Rafael Dowsley is supported by the BIU Center for Research in Applied Cryptography and Cyber Security in conjunction with the Israel National Cyber Bureau in the Prime Minister’s Office.
Authors’ contributions
All authors worked together on the overall design of the solution in the clear and in private. DR designed the new cryptographic protocols for secure batch bitdecomposition and secure activation function. DR implemented the entire solution in the RUST programming language. DR was responsible for running the experiments of our work, and AT was responsible for running the experiments on SecureML. RD verified and wrote the functionality and security proofs of our protocols. JS provided intheclear model testing, and worked on the submission details to the iDASH competition. All authors discussed results and wrote the manuscript together. All authors have read and approved the manuscript.
Acknowledgements
Not applicable.
Comments
There are no comments yet.