When Homomorphic Cryptosystem Meets Differential Privacy: Training Machine Learning Classifier with Privacy Protection

12/06/2018 ∙ by Xiangyun Tang, et al. ∙ IEEE Beijing Institute of Technology 0

Machine learning (ML) classifiers are invaluable building blocks that have been used in many fields. High quality training dataset collected from multiple data providers is essential to train accurate classifiers. However, it raises concern about data privacy due to potential leakage of sensitive information in training dataset. Existing studies have proposed many solutions to privacy-preserving training of ML classifiers, but it remains a challenging task to strike a balance among accuracy, computational efficiency, and security. In this paper, we propose Heda, an efficient privacypreserving scheme for training ML classifiers. By combining homomorphic cryptosystem (HC) with differential privacy (DP), Heda obtains the tradeoffs between efficiency and accuracy, and enables flexible switch among different tradeoffs by parameter tuning. In order to make such combination efficient and feasible, we present novel designs based on both HC and DP: A library of building blocks based on partially HC are proposed to construct complex training algorithms without introducing a trusted thirdparty or computational relaxation; A set of theoretical methods are proposed to determine appropriate privacy budget and to reduce sensitivity. Security analysis demonstrates that our solution can construct complex ML training algorithm securely. Extensive experimental results show the effectiveness and efficiency of the proposed scheme.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning (ML) classifiers are widely used in many fields, such as spam detection, image classification, and natural language processing. Many studies have modeled user data and obtained satisfactory classifiers that meet accuracy requirements

[19, 36]

. The accuracy of a classifier obtained from supervised learning is closely related to the quality of the training dataset, in addition to well-designed ML algorithms. An experimental study with a dataset of 300 million images at Google

[29] demonstrates that the performance of classifiers increases as the order of magnitude of training data grows. However, training dataset is usually held by multiple data providers and may contain sensitive information, so it is important to protect data privacy in training of ML classifiers.

Consider the typical training process depicted in Figure 1. There are multiple data providers and a single data user. Upon receiving the request of dataset from the data user, each data provider applies privacy-preserving mechanisms (e.g., encryption or perturbation) to its own dataset. Then, the data user trains an ML classifier based on the dataset gathered from multiple data providers. During this process, each data provider cannot know the classifier, while the data user cannot learn any sensitive information of the shared data.

More specifically, consider the following example of an advertisement recommendation task: In order to attract more consumers, an company wants to build a classifier to discern the most appropriate time for advertising. The training dataset used for constructing the classifier is extracted from the consumer purchase behavior data recorded by several online shopping sites. The consumer data is confidential because it contains sensitive information about consumers. Online shopping sites agree to share their data with companies, but refuse to reveal any privacy of the consumers. The company wants to construct a classifier based on the consumer data, but is unwilling to reveal the classifier to online shopping sites. Ideally, online shopping sites and the company run a privacy-preserving training algorithm, at the end of which the company learns the classifier parameters, and neither party learns anything else about the other party’s input.

Fig. 1: Application Scenario. Each non-shaded rectangle represents a type of role. Each shaded box indicates private data that should be accessible to only one party: a protected dataset to a data provider, and the model to the data user. Each solid arrow indicates an algorithm or a process.

In general, supervised ML classifiers consist of two phases: the training phase and the classification phase. A series of secure schemes for the classification phase have been proposed [6, 10]. In this paper, we focus on the training phase, that is, privacy-preserving training of ML classifiers111In this paper, ML classifers and ML models are used interchangeably..

Existing solutions to training ML classifier securely roughly depend on three types of techniques, namely secure multi-party computing (SMC), homomorphic cryptosystem (HC), and differential privacy (DP). SMC can construct many classifiers theoretically. But it relies on a trusted third-party for providing random number, and results in a large number of interactions and redundant computations for protecting data privacy [15, 23]. HC222In this paper, we only consider partially HC due to the computational inefficiency of fully HC. allows the operation on ciphertext to be mapped to the corresponding plaintext. The secure training solutions based on HC [17, 11] may suffer from low efficiency. In addition, since partially HC only enables a single type of operation (e.g., addition or multiplication), HC-based solutions for training complex ML classifiers usually introduce a trusted third-party (e.g., the authorization server [17, 10]) or use an approximate equation that simplifies the complex iteration formula [2, 3]. DP can resist the attacker with the largest background knowledge [5], which ensures the security of the published data by adding noises. The computational efficiency of operations on perturbed data is significantly higher than those on ciphertext [5, 9]. Nevertheless, the quality of the published dataset is reduced due to the introduction of noises, and thereby the accuracy of the resulting classifiers is decreased inevitably.

As discussed above, HC is low efficient due to ciphertext-based computation, but can obtain a classifier with lossless accuracy. DP has high computational efficiency but leads to an inevitable loss of accuracy. Intuitively, we can take the strengths of HC and DP by adopting them simultaneously.

However, HC and DP are completely different systems: one for data encryption, and the other for data perturbation. It is a challenging task to combine them together. In particular, partially HC only supports one type of operation, which sets a barrier to the training of ML classifiers with complex operations such as power function, division, and square root. Furthermore, the noises added to sensitive data in DP determines the accuracy of classifiers and privacy of published data. The third challenge is how to archive high accuracy while ensuring privacy in DP.

In this paper, we propose Heda, an efficient privacy-preserving scheme for training ML classifiers. By combining HC with DP, Heda obtains the tradeoffs between efficiency and accuracy and enables flexible switch among different tradeoffs by parameter tuning. Security analysis demonstrates that our building blocks can construct complex ML training algorithms. Extensive experimental results show the effectiveness and efficiency of the proposed scheme.

We address the above challenges by developing a set of key techniques.

We make an observation that different features333

Without loss of generality, when facing the same training task, we assume that all the dataset has been locally preprocessed and represented with the same feature vectors

[32, 17]. in a dataset usually contribute differently to the accuracy of classifiers [21, 31]. For the features with high contributions, we apply HC to these features such that the model parameters obtained from them are as accurate as possible. We apply DP to the rest features to improve the computational efficiency. The contribution of each feature in training ML classifiers can be evaluated using readily available techniques [21].

To address the second challenge, we employ two homomorphic encryption primitives: a multiplicative homomorphic encryption RSA and an additively homomorphic encryption Paillier. We carefully design a library of building blocks supporting for complex operations such as power function and dot product, which can handle ML classifiers with complex training operations. We take Logical Regression (LR) as an example to illustrate the power of our building blocks. The sigmoid function in the iterative formula of LR makes it difficult to construct a secure LR training algorithm based on HC. It is the first time that constructing a secure LR training algorithm by HC without an authorization server or any approximation. (Section

VI)

In the face of the third challenge, we develop a formal method to determine the reasonable privacy budget, and we reduce the sensitivity by using insensitive microaggregation. We reduce the added noise and improve the usability of the noise dataset published by DP reasonably. (Section V)

To the best of our knowledge, it is the first study that achieves privacy-preserving training of ML classifiers by jointly applying HC and DP in an individual scheme. The rest of our paper is organized as follows. Section II describes related work, and Section III provides the background. Section IV describes the problem statement. Section V and Section VI present the special designs with DP and HC, respectively. Section VII describes the construction of Heda in detail. The security analysis is exhibited in Section VIII, and the evaluation results are provided in Section IX. Section X concludes this paper.

Ii Related Work

Since our work is related to secure ML classifiers algorithms which can be broadly divided into two categories: privacy-preserving classification and privacy-preserving training. We give the literature review of both subjects. Because Heda jointly applying HC and DP, and there are some studies about combining HC with DP but not about secure classifiers training, we present a discussion about these works. We give an analysis of our novelty at last.

Ii-a Privacy-Preserving ML Classification

A series of techniques have been developed for privacy-preserving ML Classification. Wang et al. [33] proposed an encrypted image classification algorithm based on multi-layer extreme learning machine that is able to directly classify encrypted images without decryption. They assumed the classifier had been trained, and the classifier not confidential. Grapel et al. [18] constructed several secure classification algorithms by HC, while the parameters of trained classifiers are not confidential for classifiers users. Zhu et al. [37] proposed a secure nonlinear kernel SVM classification algorithm, which is able to keep users’ health information and healthcare provider’s prediction model confidential.

Several works have designed general (non-application specific) privacy-preserving protocols and explored a set of common classifiers by HC [6, 10]. Usually, classification algorithms are simpler than training algorithms, building blocks that are able to build classification algorithms can be powerless for complex training algorithms.

Ii-B Privacy-Preserving ML Classifier Training

Three techniques have been applied to privacy-preserving ML classifier training, they are SMC, HC, and DP. Constructing secure classifier training algorithms based on SMC relies on a large number of interactions and many redundant calculations for protect privacy, and it generally needs to introduce authoritative third parties to provide random number distribution services as well. In addition, SMC protocols for generic functions existing in practice rely on heavy cryptographic machinery. Applying them directly to model training algorithms would be inefficient [23, 4, 35].

HC is able to compute using only encrypted values. Employing HC, many secure algorithms have been developed for different specialized ML training algorithms such as Support Vector Machine (SVM)

[17, 27], LR [11, 18]

, decision trees

[30]

and Naive Bayes

[24]. However, partially HC only enables a single type of operation (e.g., addition or multiplication). In order to construct complex training algorithms, HC-based schemes usually need to rely on trusted third parties such as the Authorization Server [17, 10], or use an approximate equation to simplify the original complex iteration formula into a simple one [2, 3]. Gonzlez et al. [17]

developed secure addition protocol and secure substractions protocol to construct the secure SVM training algorithm by employing Paillier, while some operations that are not supported by Paillier have to be implemented with the assistance of the Authorization Server in their scheme. Secure LR training algorithms existing implemented by HC are actually the linear regression

[11, 18], because the sigmoid function contains power function and division operation, which makes LR training algorithms harder to be implemented by HC than other ML training algorithms. Several works solved the sigmoid function by an approximate equation444 [2, 3].

Many secure ML classifier training algorithms have been explored in DP area such as decision tree [5], LR [9]

and deep learning

[1]. Blum et al. [5] proposed the first DP based decision tree training algorithm on the SuLQ platform. Abadi et al. [1]

applied DP objective perturbation in a deep learning algorithm, where the noise was added to every step of the stochastic gradient descent. Due to the introduction of noise under DP mechanisms, the quality of the datasets were reduced, and the accuracy of these trained models was decreased inevitably. So the essential challenge for DP based frameworks is guaranteeing the accuracy by reducing the added noise, especially for the operation has high sensitivities

[38]. According to the Laplace mechanism (cf. Definition 4), privacy budget and the sensitivity are two important factors affecting noise addition. In many papers, the value of is merely chosen arbitrarily or assumed to be given [5, 1]. Lee et al. [22] explored the selection rules of , but they have not given a way to determine the value of the privacy budget. Soria et al. [28] proposed a insensitive microaggregation-based DP mechanism, they found the amount of noise required to fulfill -DP can be reduced in insensitive microaggregation. Heda develops the insensitive microaggregation-based DP mechanism and decreases the amount of noise required to fulfill -DP again.

Ii-C Homomorphic Cryptosystem Combine Differential Privacy

Several works have studied combining HC with DP to solve a special security problem. Pathak et al. [26] proposed a scheme for composing a DP aggregate classifier using classifiers trained locally by separate mutually untrusting parties, where HC was used for composing the trained classifiers. Yilmaz et al. [34] proposed a scheme for optimal location selection utilizing HC as the building block and employing DP to formalize privacy in statistical databases. Aono et al. [3] constructed a secure LR training algorithm via HC and achieved DP to protect the model parameters. These works general constructed a secure algorithm via HC and used DP to protect the algorithm results. As we have discussed above, constructing a secure algorithm via HC is low efficient, and secure algorithm based on DP has inevitable loss in accuracy. We aim of constructing a secure classifier training algorithm jointly applying HC and DP in an individual scheme to obtain a tradeoff between efficiency and accuracy.

Ii-D Novelty of Our Construction

Secure training algorithms based on HC have to handle datasets in ciphertext case, where the time consumption is considerable, while the accuracy is able to be guaranteed. Noise datasets published by DP mechanism are in plaintext case, it is efficient to train a model in plaintext case, while using the noise dataset may lead to a low accuracy. HC and DP have drawbacks as well as merits.

Heda takes the strengths of HC and DP to get a high-efficiency and high-accuracy privacy-preserving ML classifier training algorithm. Heda is the first to combine these two techniques and construct a privacy-preserving ML classifier training algorithm in a multi-party setting, where feature evaluation techniques are employed to give the way of combination. By combining HC with DP, Heda obtains the tradeoffs between efficiency and accuracy, and enables flexible switch among different tradeoffs by parameter tuning. What’s more, we develop a library of building blocks by HC that is able to construct complex training algorithms, and by using our building blocks this is the first time that solving the sigmoid function in secure LR training based on HC without any approximate equation. We develop the works of Lee et al. [22] and Soria et al. [28] giving a formula to determine the appropriate privacy budget and another lower sensitive solution.

Iii Priliminaries

Iii-a Notation

A dataset is an unordered set of n records with the size of . , is the i-th record in dataset , and is a class label correspond to . , . , and are the relevant parameters of the model trained by a ML algorithm. The subset corresponding to the i-th attribute in . S is the scores assign to the features.
Cryptosystems define a plaintext space , and a ciphertext space . In Heda, we employ two public-key cryptosystems, Paillier and RSA. and are represented as the ciphertext of m under Paillier or RSA respectively.
DP is generally achieved by a randomized algorithm . is the privacy budget in a DP mechanism. A query maps dataset D to an abstract range . The maximal difference in the results of query is defined as the sensitivity . is a neighboring dataset of .
Table I summarizes the notations used in the following sections.

Notations      Explanation Notations      Explanation
Set of real numbers d-dimension
Dataset The size of D
Size of dataset Neighbour dataset
The record set in D The label set in D
i-th Record in dataset Class label
Functional operation Dataset dimension
Plaintext space Parameters of models
Mechanism Ciphertext space
The cluster size n-bit Primes
Query Privacy budget
Noise Sensitivity
Ciphertext under Paillier Ciphertext under RSA
The number of encrypted features in D S The scores of features
The encryption of m under a certain cryptosystems The subset of i-th attribute in D
TABLE I: Notations

Iii-B Homomorphic Cryptosystem

Cryptosystems are composed of three algorithms: key generation (Gen) to generate the key, encryption (Enc) encrypting secret message and decryption (Dec) for decrypting ciphertext. Public-key cryptosystems employ a pair of keys (, ), the public key (, the encryption key) and the private key (, the decryption key). Some cryptosystems are gifted with a property of homomorphic that makes cryptosystems perform a set of operations on encrypted data without knowledge of the decryption key. Formalized definition is given in Definition 1.

Definition 1

(homomorphic) [20]. A public-key encryption scheme is homomorphic if for all and all output by , it is possible to define groups , (depending on only) such that:
(i) The message space is , and all ciphertexts output by are elements of .
(ii) For any , any output by , and any output by , it holds that .

In Heda, we employ two public-key cryptosystems, Paillier and RSA. Paillier possesses additively homomorphic property, and RSA possesses multiplicative. For more details about Paillier or RSA, we refer the reader to [20].

Paillier. The security of Paillier is based on an assumption related to the hardness of factoring. Assuming a pair of ciphertext is under the same Paillier encryption scheme where the public key is , we have: , where . The additively homomorphic property in Paillier can be described as .

RSA. Based on the definition of a one-way trapdoor function, RSA gives the actual implementation of the first public key cryptosystem. RSA is a multiplicative HC, because that: , where . The multiplicative homomorphic property in RSA can be described as .

Iii-C Differential Privacy

Definition 2

(Neighbor Dataset) [5]. The datasets and have the same attribute structure, and the symmetry difference between them is denoted as . We call and neighbour datasets if .

Definition 3

(-Differential Privacy) [5]. A randomized mechanism gives -DP for every set of outputs , and for any neighbor dataset of and , if satisfies: .

A smaller represents a stronger privacy level [38]. While is equal to 0, for any neighbour dataset, the randomized mechanism

will output two identical results of the same probability distribution which cannot reflect any useful information. If

is selected as a too large value in a DP mechanism, it does not mean that privacy is actually enforced by the mechanism. A composition theorem for named parallel composition (Theorem 1) is widely used.

Theorem 1

(Parallel Composition) [25]. Suppose we have a set of privacy mechanisms . If each provides a -DP guaranteed on a disjointed subset of the entire dataset, will provide -DP.

Lapace Mechanism (Definition 4) is the basic DP implementation mechanism and is suitable for the numerical data, which adds independent noise following the Laplace distribution to the true answer.

Definition 4

(Laplace mechanism) [14]. For a dataset D and a query function with sensitive . Privacy mechanisms providers -DP, where represents the noise sampled from a Laplace distribution with a scaling of .

Definition 5

(Sensitivity) [5]. For a query , and a pair of neighbor datasets (, ), the sensitivity of is defined as: . Sensitivity is only related to the type of query . It considers the maximal difference between the query results.

Iv Problem Statement

We are devoted to addressing the problem on the secure training of ML classifier using private protected data gathered from different data providers. In this section, we introduce the overview of the system model and the roles involved in Heda. Then, we formally define the threat model and the security goal.

Iv-a System Model

We target at the system application scenario which has been illustrated in Figure 1. There are data providers and a data user in our model. Each holds their own sensitive dataset and a pair of keys (, ). protects their sensitive data by applying privacy-preserving mechanisms (e.g., DP mechanism and HC). holds his own keys (, ). After obtaining the permission, requests the sensitive data from , and returns the protected data. By running a sequence of secure interactive protocols with several , obtains the classifier parameters of being encrypted by ’s keys.

As discussed in Section I and II, HC is able to construct accurate secure training algorithms, and DP mechanism providers high efficient secure training algorithms. However, it is low efficient that constructing a secure ML training algorithm by HC, and the model may poor in accuracy if the training data is under DP mechanism. We thereby desire to take the strengths of HC and DP, and feature evaluation techniques is used for providing a right combination method. We describe the overall idea of Heda as follows:
1) scores all features by feature evaluation techniques and divides the dataset into two parts according to the scores (see Section VII-A).
2) applies privacy-preserving mechanisms to the two parts respectively: the low scores part published by DP mechanism (see Section V); the high scores part encrypted by HC (see Section VI).
3) Upon receiving the query requests, sends the protected data to .
4) trains a ML classifier under these two protected sub-datasets (see Section VII-C).

Iv-B Threat Model

interacts with several to obtain the protected data and performs training algorithms on the data. Each trys to learn as much other ’s sensitive data and ’s trained classifier as possible by honestly executing pre-defined protocols. follows the protocol honestly, but it tries to infer ’s sensitive data as much as possible from the values he learns. As discussed above, we assume each participant is a passive (or honest-but-curious) adversary [16], that is, it does follow the protocols but tries to infer others’ privacy as much as possible from the values they learn.

Iv-C Security Goal

In Heda, we allow any two or more parties conspire to steal the privacy of other participants. We make the following assumptions: Each participate as a honest-but-curious adversary performs protocol honestly but may have interest in the private information of other domains. Any two or more participates may collude with each other. As passive adversaries, they do follow the protocol but try to infer other’s privacy as much as possible from the values they learn.

The aim of Heda is achieving keeping privacy of each participant and computing model parameters securely when facing honest-but-curious adversaries or any collusion. To be specific, the privacy of is model parameters, and each is their sensitive data. We specify our security goals as follows:

  1. When facing honest-but-curious adversaries, and each ’s privacy are confidential.

  2. when facing any two or more parties collude with each other, and each ’s privacy are confidential.

V Accuracy and Privacy Design with Differential Privacy

DP ensures the security of the published data by adding noise. Insufficient noise leads to the security of the published data cannot be guaranteed, while excess noise causes the data unusable. Obviously, the key to using DP in the secure classifier training is to reduce the added noise while ensuring the security of the published data.

The two important parameters that determine the added noise are and (cf. Definition 4). A bigger or a smaller are able to reduce the added noise. However, if is selected as a too large value, although the system has been built upon DP framework, it dose not mean that privacy is actually enforced by the system. Therefore, must be combined with specific requirements to achieve the balance of security and usability of output results. On the other hand, is only determined by the type of query function (cf. Definition 5).

In this section, we develop a formula for reasonably determining the appropriate in DP mechanism, and we reduce the by using insensitive microaggregation.

V-a Selection of Appropriate

In many papers, is chosen arbitrarily or assumed to be given, while decision on should be made carefully with considerations of the domain and the acceptable ranges of risk of disclosure. Lee et al. [22] explored the rule of , but they did not give a specific method for determining . We give a method for determining . It is worth noting that based on different criteria and backgrounds, can have different values, and we are trying to give a general one.

We follow some notations of Lee et al. [22]: If an adversary knows all the background knowledge, he tries to guess which one is the different values between and . Let denotes the set of all possible combinations of , . For each possible , the adversary maintains a set of tuples . For a given query response, and are the adversary’s prior belief and posterior belief on , i.e., . For each possible , the adversary’s posterior belief on is defined as . Lee et al. [22] obtain the upper bound of through a series of derivations as Formula 1 (cf. Section V in [22])

(1)

where , is the probability that the adversary guessing success. Nevertheless, Lee et al. [22] did not give a method for setting . We give a method for determining the upper bound of (Proposition 1).

Proposition 1 (the upper bound of for )

Let is the subset of the j-th attribute in dataset . is the occurrences number of the record which has the highest frequency in . Then is the upper bound of .

Proof 1 (Proof of Proposition 1)

is the probability that the adversary successfully guesses which instance is the different one between and . DP mechanism assumes that the adversary has a strong background knowledge, that is, he knows the value of each instance in . is the highest frequency instance in , so the adversary guesses will get the highest probability of success. After DP mechanism, the adversary’s probability of success should not be greater than the highest probability of random guessing and success, so the upper bound of is .

The upper bound of is obtained form each subset by Formula 1, then the dataset provides -DP according to Theorem 1. Algorithm 1 details the steps for generating the appropriate on dataset D.

1:: .
2:: The appropriate on dataset .
3:for  to d do
4:     for  to m do
5:          Computing and in .      
6:     Obtaining by Formula 1.
7:return .
Algorithm 1 Generating Appropriate Value of

V-B Reducing by Insensitive Microaggregation

According to the Definition 4, the smaller the , the less noise is added, and thereby the more usable the data is. In this subsection, we detail the solution of reducing the in Heda. The amount of noise required to fulfill -DP can be greatly reduced if the query is run on a insensitive microaggregation version of all attributes instead of running it on the raw input data [28].

V-B1 What is insensitive microaggregation

Microaggregation is used to protect microdata releases and works by clustering groups of individuals and replacing them by the group centroid. DP makes no assumptions about the adversary’s background knowledge. Microaggregation with DP can help increasing the utility of DP query outputs while making as few assumptions on the type of queries as microaggregation does [28]. However, if we modify one record in , more than one clusters will differ from the original clusters generally. According to the Definition 3, we expect that we modify one record in , each pair of corresponding clusters differs at most in single record. Microaggregation that satisfies this property is named insensitive microaggregation (IMA). Soria et al. [28] give a formal definition of IMA. Microaggregation is insensitive to input data if and only if the distance function is a fixed sequence of total order relations defined over the domain of [28].

The sequence of total orders is determined by a sequence of reference points. The reference points are the two boundary points and , i.e. , , and , . The total order relations between two points in Heda is: . Generating a IMA dataset is detailed in Algorithm 2.

1:: ,k is the cluster size, .
2:: A IMA dataset that can perform DP.
3:Set
4:while  do
5:     Computing the boundary point and .
6:      nearest instances to from according to , .
7:     k nearest instances to from according to , .
8:     
9: remaining records.
10:Computing each centroid of and use it to replace the records in each cluster.
11:return .
Algorithm 2 Generating an IMA Dataset

V-B2 Determining the sensitivity

As Definition 5, in dataset , and the sensitivity in IMA is , which is formalized in the Proposition 2. We detail Algorithm 3 constructing our DP mechanism.

Proposition 2 ( in IMA)

is a query function with -DP mechanism returning the noised values corresponding to the j-th attribute of . After obtaining by Algorithm 2, the sensitivity of with cluster size is , where .

Proof 2 (Proof of Proposition 2)

If is an IMA algorithm, for every pair of datasets and differing in a single record, there is a bijection between the set of clusters and such that each pair of corresponding clusters differs at most in a single record. So if the centroid is computed as the mean of the records in the same cluster, then the maximum change in any centroid is, at most, . The modification of single record may lead to multiple modifications of the centroid of clusters, and there are 555 denotes a ceiling functions. different clusters in .
According to a distance function with an total order relation, IMA algorithm iteratively takes sets with cardinality from the extreme points until less than records are left. The less than records are formed the last cluster that is the cluster at the center of the total order relation sequence. Every in is ordered by . A pair of databases differing only in one instance means the larger database contains just one additional row [13]. The number of clusters on the left and the right of is equal, as shown in the Figure 2. If the different record in is on the left of and is located to . Then the changed clusters are the clusters from to , and the maximum change for each changed cluster is . Other clusters on the right side of will not be changed. The worst scenario is when is located to , there is the maximum number of changed clusters. The scenario on the left and the right sides of is symmetrical, so the number of changed clusters is at most .

Fig. 2: Clusters in IMA

To make the sensitivity of smaller than the original dataset, we let , then we can get the best cluster size: . Soria et al. [13] thought that and differ in a “modified” record . The modification causes the whole sequence originally obtained by is changed from the position of in turn. So they considered the sensitive is . However, Dwork er al. [13] give that: “On pairs of Adjacent Dataset differing only in one row, meaning one is a subset of the other and the larger database contains just one additional row.” Their sensitive method causes a greater sensitivity than ours, which reduces the usability of the dataset.

1:: ,k is the cluster size, .
2:: An IMA -DP dataset .
3:Generating the appropriate on dataset by Algorithm 1.
4:Obtaining an IMA dataset from by Algorithm 2.
5:Obtaining noise by using and (cf. the definition of Laplace mechanism 4).
6:Adding to .
7:return .
Algorithm 3 IMA -DP Mechanism

Vi Privacy Design with Homomorphic Cryptosystem

A homomorphic encryption algorithm can only support one type of operation (e.g., Addition or Multiplication). Existing HC-based secure training algorithms need to rely on trusted third parties such as the Authorization Server [17, 10], or use an approximate equation to simplify the original complex iteration formula[2, 3].

We elaborately design a library of building blocks by multiplicative homomorphic encryption RSA and additively homomorphic encryption Paillier, which is able to construct complex secure ML training algorithms needing no the Authorization Server or any approximate equation. In order to illustrate the power of our building blocks, we construct a secure LR training algorithm (see Section VII-B). It is the first solving the sigmoid function based on HC. In this section, we detail our building blocks. The security proofs for each building block are given in Section VIII.

ML training algorithms are computationally complex, so the building blocks need to support a range of choices including which party gets the input, which party gets the output, and whether the input or output are encrypted or not. Table II shows the different conditions for building blocks. For all conditions, both parties Alice and Bob cannot obtain other useful information except for the legal information, the input and output of other parties are confidential.

Conditions Input Output Protocols
Alice Bob Alice Bob
Condition 1 , , , , - 1, 2, 3, 5
Condition 2 , , , , - 1, 2, 4
Condition 3 , , , - 6
Condition 4 , , , - 7
TABLE II: The Conditions of Building Blocks

1) Secure addition and secure subtraction: Relying on Paillier’s additive homomorphic property, it is straightforward to obtain the secure addition protocol (Protocol 1) and secure subtraction protocol (Protocol 2).

1:Alice: , (, ).
2:Bob: , or b=.
3:Bob: .
4:Alice sends to Bob.
5:for i to d do
6:     Bob computes .
7:return to Bob.
Protocol 1 Secure Addition Protocol
1:Alice: , (, ).
2:Bob: , or .
3:Bob: .
4:Alice sends to Bob.
5:for i to d do
6:     Bob computes .
7:return to Bob.
Protocol 2 Secure Subtraction Protocol

2) Secure dot product and secure multiplication: Using Paillier’s additive homomorphism property, we can construct a secure dot product protocol (Protocol 3) that satisfies Condition 1 easy (i.e., ). However, when Bob only has ciphertext who is unable to perform . Paillier fails to construct a secure dot product protocol that satisfies Condition 2. Because the length of Paillier’s ciphertext is bits or longer usually (We will discuss the key length setting in detail in Section IX.), so the computational complexity of is awfully large. Therefor, when faced with Condition 2, we use RSA’s multiplicative homomorphism property to construct secure multiplication protocol (Protocol 4).

1:Alice: , (, ).
2:Bob: and .
3:Bob: .
4:Alice sends to Bob.
5:Bob computes .
6:return to Bob.
Protocol 3 Secure Dot Product Protocol
1:Alice: , (, ).
2:Bob: and .
3:Bob: .
4:Alice sends to Bob.
5:Bob computes .
6:return to Bob.
Protocol 4 Secure Multiplication Protocol

3) Secure power function: In order to cope with more complex training algorithms, we design protocol 5 satisfying Condition 1 under RSA to obtain securely.

1:Alice: , (, ).
2:Bob: and .
3:Bob: .
4:Alice sends to Bob.
5:Bob Initializes .
6:In Bob:
7:for i to d do
8:     Letting .
9:     for t to  do
10:           by protocol 4.      
11:      by protocol 4.
12:return to Bob.
Protocol 5 Secure Power Function Protocol

4) Secure changing the encryption cryptosystem: There are multiple participants in Heda. Different participants have their own encryption scheme (i.e., the certain plaintext space and a pair of keys (,)). Homomorphic operations can only be operated in the same plaintext space. For completeness, we design two protocols converting ciphertext from one encryption scheme to another while maintaining the underlying plaintext. The first (Protocol 6) switches to satisfying Condition 3, and the other (Protocol 7) switches to satisfying Condition 4.

Proposition 3 (The security of building blocks)

Protocol 1 to 7 is secure in the honest-but-curious model.

1:Alice: (, ) and (, ).
2:Bob: .
3:Bob: .
4:Bob uniformly picks and computes by Protocol 5.
5:Bob sends to Alice.
6:Alice decrypts , obtains and sends to Bob.
7:Bob computes by protocol 3.
8:return to Bob.
Protocol 6 Converting Ciphertext: to
1:Alice: (, ) and .
2:Bob: .
3:Bob: .
4:Bob uniformly picks and computes by Protocol 1.
5:Bob sends to Alice.
6:Alice decrypts , obtains and sends to Bob.
7:return to Bob.
Protocol 7 Converting Ciphertext: to

Vii Construction of Heda

In this section, we introduce the overall framework of Heda. In Heda, is able to learn a model without learning anything about the sensitive data of , and in addition to , others should learn nothing about the model. The security proof will be given in Section VIII.

Heda is exhibited in Algorithm 4. We introduce the following details in this section: how to use feature evaluation technologies to divide a dataset into two parts (Algorithm 4 Step 2), how to construct a specific training algorithm using building blocks (Algorithm 4 Step 4), and how to combine DP mechanism with the building blocks (Algorithm 4 Step 4).

1:, (,) and (,).
2:(, ).
3:.
4: initializes .
5:According to , each divides all features in two part: the high scores part and the low scores part.
6:each obtains the noise dataset from the low scores part subdataset by Algorithm 3 and sends it to .
7: trains a model by building blocks (Algorithm 5) combine with noised datasets.
8:return to .
Algorithm 4 Privacy-Preserving Training
Proposition 4 (The Security of Heda)

Algorithm 4 is secure in the honest-but-curious model.

Vii-a Feature Partitioning

Obviously, It is best that each conducts the feature evaluation locally. The locally computation does not require interaction with any other party which guarantee the privacy of each . In addition, can perform arbitrary computations on their sensitive data in plaintext case with high efficiency. After Several who join in Heda locally implement feature evaluation, they communicate with each other negotiating the final scores . Feature scores do not reveal the privacy of datasets, so it is feasible that several share the scores of their datasets and negotiate the final feature scores.

According to the feature scores , each processes the dataset into an ordered dataset. Let an ordered dataset ordered by feature scores, i.e. is the scores of , . Let the high scores part has features, then , , .

We assume that spends in learning a classifier parameters on the low scores part (the noise dataset) and on the high scores part (the encrypted dataset), then the total time is . Training a model in plaintext case usually takes time in milliseconds but usually takes thousands of seconds or longer in ciphertext case [10, 11]. There is a linear relationship between and , i.e. , where , and are two constants. The training time on noise dataset is much less than the time training a model on the encrypted dataset. Formula 2 shows the total time consumption.

(2)

Heda enables flexible switch among different tradeoffs between efficiency and accuracy by parameter tuning. With parameter , one is able to obtain the tradeoff between efficiency and accuracy. As the decreasing number of the , the total time consumption is consequent reduction. When the number of dimensions assigned to the high scores part is small, the accuracy is relatively low. According to the specific situations, is set appropriately.

As for the selection of feature evaluation techniques, many excellent feature evaluation techniques have been studied [8, 21]

. When facing a dataset with different types, different backgrounds or different magnitude, different methods have their drawbacks as well as merits. We evaluate six widely used methods in our experiments. The methods we use are: Chi-square test, Kruskal-Wallis H (KW), Pearson correlation, Spearman correlation, Random forest and minimal Redundancy Maximal Relevance (mRMR). We are committed to finding a feature evaluation technique with the best robustness. After extensive experiments, we find that KW has the most stable effect when facing with different datasets (see Section

IX-D).

Vii-B Constructing Specific Training Algorithms using Building Blocks

There are rich variety of ML algorithms. Describing the implementation of building blocks towards each ML training algorithm naturally requires space beyond the page limit. We use LR666In order to maintain the continuity of our description, LR is also used as the example in the following as an example to illustrate how to construct secure model training algorithms by our building blocks.

LR training algorithm is not the most complicated one compared to other ML classifier training algorithms. However, the iterative process of LR involves sigmoid function () which makes it difficult to implement in ciphertext case. Most studies claimed they had constructed a secure LR training algorithm by HC which were the secure linear regression training algorithms actually, or they solved the sigmoid function by an approximate equation (cf. Section II-B). To best of our knowledge, it is the first constructing a secure LR training algorithm by HC. Our HC-based secure LR training algorithm only needs 3 interactions (i.e. interactions between and ) throughout each iteration process.

LR is a binary classifier and try to learn a pair of parameters and , where to satisfy and . LR uses Sigmoid Function to associate the true label with the prediction label . Let , , the iteration formula of LR is shown in Formula (3). The steps of LR training algorithm are as follows: (i) Initializing learning rate , a fixed number of iterations and model parameters . (ii) Updating by Formula (3). (iii) If the maximum number of iterations or minimum learning rate is reached, output ; otherwise go to step 2.

(3)

Each building block is designed in a modular way, so carrying out secure LR training algorithm come down to invoking the right module. Suppose there are n data providers , Algorithm 5 specifies our secure LR training algorithm. In all execution steps of Algorithm 5, when protocols are called, is the role of Alice, and is the role of Bob.

Proposition 5

Algorithm 5 is secure in the honest-but-curious model.

1:, (,) and (,).
2:(, ).
3:.
4: initializes a learning rate , a fixed number of iterations and .
5:while t in or minimum learning rate is not reached do
6:      sends ) to .
7:     for  to n do
8:           sends ) to .
9:           obtains by Protocol 5 and Protocol 6 sequentially.
10:           uniformly picks and computes by Protocol 1 and Protocol 4 sequentially.
11:           sends to .
12:           decrypts and sends to .
13:           obtains by Protocol 3.
14:           obtains by Protocol 1 and Protocol 2 sequentially.
15:           obtains by Protocol 7.
16:           updates by