1 Introduction
In machine learning, ensemble learning[19] refers to the strategy of combining multiple hypotheses to form a better one. The superior performance over a single base learning algorithm often leads to a better prediction. Many works have been conducted under the area and there are three conventional methods, the bagging[6], boosting[35, 16] and stacking[44]. Among all three, the boosting method focuses on training a set of base learners sequentially and assigning each base learner a weight, then the final decision is made by a weighted majority voting over all base learners. As an example, the Boosted Decision Tree (BDT) is of great popular and widely adopted in many different applications, like text mining[1], geographical classification[30] and finance[45]. Like other machine learning algorithm, the boosting algorithm is developed under the assumption of a centralized fashion, where all data has been collected together. However, in the big data era, the data is generated and stored more sparsely than before, which brings new challenges and difficulties for such algorithm, for instance, privacy protection is one of the most emerging demands in current society.
Data explosion results in tons of data are generated and held by the individual and entities, such as the personal image, financial records and census data. The privacy concern raises when the data leaves the hands of data owners and participates in some computations. The AOL search engine log[37] and Netflix prize contest[28] attacks prove such threat and show the needs that machine learning algorithms should be privacy aware. As a promising solution, Differential Privacy [14] gives a definition of the privacy and how it can be protected. A mechanism is said to be differentially private if the computation result of the data is robust to any change of any individual sample. And several differentially private machine learning algorithms[22] has been developed since then. A trusted third party is originally introduced to gather data from individual owners and is responsible to process the data privately. Recently, Local Differential Privacy (LDP)[10, 41] takes the concept into a local setting and a mechanism is defined to be local differentially private if the processing makes any two samples indistinguishable. An advantage with LDP is that it allows the data owners to perturb the input by themselves and the true values never leave the hands of the data owner. Thus unlike DP, there is no need of trusted third party anymore.
The indistinguishability of any two data samples brings a strong privacy guarantee for the data owners, and relieve the fear of information leakage. Thus developing the machine learning algorithm over such “private” data samples is of great value and importance. To the best of our knowledge, there is little work about developing the boosting algorithm with the protection of LDP. In this paper, we are eager to fill such gap by developing a privacypreserving boosting algorithm that satisfies LDP. More specifically, we consider such a problem: there are two types of parties, multiple data owners and a data user. Each data owner holds a set of training samples; the data user intends to fit a boosted classifier (e.g., BDT) with the samples from data owners. During the computation, the data owner perturbs their data using the mechanism that satisfies LDP and only pass such perturbed data to the data user. In the end, the data user should only learn the classifier without knowing or deriving the value of any individual sample from the classifier, thus the privacy of the data samples are protected.
In the meanwhile, the randomized perturbation mechanism brings noise into the training samples and the mechanism is said to satisfy LDP ( 0), where the smaller
is, the more privacy preserved and the more noise injected. To maintain the utility of the learned classifier, we rely on the statistical information of the perturbed samples (e.g., mean estimation), in which the adopted randomized mechanism provides an asymptotically optimal error bound. To demonstrate, we compare the utility of the learned classifier in terms of prediction capacity given different existing randomized mechanisms and show that the model learned by our algorithm effectively maintain a high utility.
Furthermore, beyond the widely adopted BDT classifier, the proposed boosting algorithm is capable to support other types of classifiers, as the original design of the boosting procedure. Thus we study other types of classifiers and analyze the strategies to support such classifiers in our boosting algorithm. In the experiment, three types of boosted classifiers are implemented and the performance of the classifiers are evaluated.
Overall, the contributions of our work are three folds:

We propose a privacypreserving boosting algorithm, and implement the widely adopted Boosted Decision Tree classifier. To the best of our knowledge, there are few works investigating the privacypreserving boosting mechanism with LDP protection and we provide a comprehensive study in this paper.

To maintain the utility, the learned classifier is built over the statistical information of the perturbed training samples, which results in an asymptotically optimal error bound. To demonstrate that, we compare multiple existing perturbation methods that satisfy LDP and show the superior performance of our method with both real and synthetic datasets.

Beyond the BDT, we also analyze how to support other classifiers with our boosting algorithm and summarize the type of data for corresponding classifiers, for instance, the Logistic Regression and Nearest Centroid Classifier. In the experiment, both boosted classifiers are also implemented and the utility of them are evaluated comprehensively.
The rest of the paper is organized as follow. The preliminary is in Section II. The problem definition and proposed solution are introduced in Section III. The experimental evaluation is given in Section IV. The related work is presented in Section V. Section VI provides the conclusion.
the dataset held by data owner  
the th sample and label held by data owner  
the value of the th sample of the th feature  
the perturbed output of  
number of sample held by data owner  
the number of data owners  
number of classes in the dataset  
dimensionality of the dataset  
probability of an event  
loss function  
number of base classifiers in the boosting algorithm  
the th base classifier fitted by data owner  
weights of the th samples  
the cross table of the th feature  
DP  differential privacy 
LDP  local differential privacy 
LR  logistic regression 
DT  decision tree 
BDT  boosted decision tree 
NCC  nearest centroid classifier 
MSE  mean squared error 
2 Preliminary
2.1 Differential Privacy
Differential Privacy (DP) attracts lots of attention in the privacy research community in the last decades, which provides a measurement of the information leakage of individual sample from the computation over an underlying dataset. Originally, DP considers the setting that a trusted data curator gathers data from multiple data owners and performs a computation from the data, like learn the mean value or find the maximum/minimum value. To ensure no one can reliably infer any individual sample from the computation result, the curator adds random noise to the result such that the released one would not change if any sample of the underlying data changes. Since no single sample can significantly affect the distribution, adversaries cannot infer the information corresponding to any individual sample confidently. Formally, given two data databases , it is said that are neighbors if they differ on at most one row. The definition of a differential private mechanism over is defined below:
Definition 1 ()Differential Privacy[14, 13]: A randomized mechanism is differentially private if for every two neighboring databases and for any ,
(1) 
where denotes the probability of an event, denotes the set of all possible outputs of the algorithm . The smaller are, the closer and are, and the stronger privacy protection gains. When , the mechanism satisfies DP, which is a stronger privacy guarantee than DP with .
2.2 Local Differential Privacy
Similar to DP, a setting of local privacy is investigated by Duchi et al. [10], namely Local Differential Privacy (LDP), which provides the plausible deniability to the data owner. LDP shifts the perturbation from the central site to the local data owner. It considers a scenario that there is no trusted third party, and an untrustworthy data curator needs to collect data from data owners and perform certain computations. The data owners are still willing to contribute their data, but the privacy of the data must be enforced. The formal definition is given below:
Definition 2 Local Differential Privacy[10, 41]: A randomized mechanism satisfies LDP if for any input and and for any :
(2) 
Comparing to DP, LDP provides another solution for the data owners. Instead of submitting the true values, the data owners perturb the true data with the mechanism that satisfies LDP and then release the perturbed ones. Thus LDP gives a stronger privacy protection, as the curator only sees the perturbed data and the true values of the data never leave the data owners’ hands.
2.3 AdaBoosting
Boosting[35]
is a widely used ensemble algorithm to reduce the bias and variance in supervised learning. Many boosting algorithms consist of constructing a classifier in an incremental fashion by adding weak classifiers to a pool, and using their weighted “vote” to determine the final classification. Usually, the weak classifier is also referred as a base classifier, the term “weak” means the classifier may have limited predication capacity, for example, the prediction result is slightly better than random guess. For consistency, we use base classifier in this paper, rather than weak classifier. AdaBoost
[16] is one of the best learning methods developed in the last two decades. In the algorithm, each base classifier is weighted based on its prediction capacity. Initially, all samples share the same equal weight. After a base classifier is added, the sample weights are readjusted such that the misclassified samples gain a higher weights and samples that are classified correctly lose weights. Thus, subsequent base classifiers focus more on the samples that were previously misclassified. And it has been proven[16] that the final model can converge to a strong learner. Given a multiclass classification problem, where there are a set of training data , and is the corresponding class label, the goal is to build a metaclassifier from the training data, which consists of a set of weighted base classifiers. The pseudo code of the SAMME [20] algorithm is presented in Alg. 1:With the return of Alg. 1, given a testing data , a class label is assigned by .
3 Problem Definition and Proposed Solution
3.1 Problem Definition
In this paper, we are interested in developing a privacypreserving boosting algorithm, As Fig.1 shows, we consider the following problem: Given data owners, each data owner holds a set of samples , and is the label associates with ; the untrustworthy data user would like to fit a boosted classifier with . Note in the paper, we assume that the data possessed by the data owner is represented as numerical values; for the categorical attribute that with distinct values, it can be transformed to
binary attributes using onehot encoding and then be proceeded accordingly.
The privacy constraint comes in twofolds at the data owner side, firstly, the data owner is not willing to share the value of individual sample to the data user; secondly, any inference of the individual data sample should be prevented from the intermediate communication messages. The symbols and notations used in this paper are summarized in Table. I.3.2 Threat Model
In the problem, we assume that the data user is untrustworthy and the data owners are honestbutcurious, in which every data owner is obliged to follow the protocol, but intentionally likes to extend their knowledge during the execution of the protocol. And the value of the individual sample is what each party intends to acquire beyond the final classifier. Thus an adversary could be either the data user, a participating data owner or an outside attacker, who intends to learn the value of data instances possessed by any data owner. It assumes that adversaries might have arbitrary background knowledge and there might be a collusion among them. The goal is to enforce the privacy of individual data instance while maintain the utility of the learned classifier. Furthermore, the untrustworthy data user could behave dishonestly, which would not compromise data owner’s privacy with our solution, but will hurt the utility of the learned classifier. Therefore, it is of the data user’s interest to correctly execute the algorithm. As such, our solution protects the privacy of the data instance from each data owner. Since the data owner is assumed to be honestbutcurious, data pollution attacks (e.g., data owners maliciously modify their inputs to bias the classifier learned by the data user) are beyond the scope of this paper.
3.3 PrivacyPreserving Boosting
We first provide an overview of our privacypreserving boosting algorithm, which is run by the data owner and data user cooperatively. Recall that boosting algorithm is design to combine multiple base classifiers into a strong one, and each base classifier is trained by the samples with readjusted weights. Our proposed algorithm follows the same idea, while intends to prevent the leakage of the true value of individual training sample at the data owner. In general, the pseudocode of the algorithm is given in Alg. 2:
The whole algorithm consists of three stages: 1), line 68, each data owner prepares a local share and perturb it accordingly, the representation of the local share is different according to the type of the base classifier, and it will be further demonstrated in later section; 2), line 910, data owner sends the perturbed local share to the data user, and data user aggregates each to obtain the base classifier for the current round, then the base classifier is sent back to each data owner; 3), line 1121, each data owner computes the weighted error rate and the sum of weights separately, then updates the weights of the local samples by collaborating with other data owners. Overall, line 621 completes a single round of the boosting algorithm, and line 23 outputs the final metaclassifier after such rounds. To compute the weighted error rates and sum of weights securely, data owners corroboratively run a secure sum computation protocol, like Shamir secret sharing[25, 36] in line 11,12 and 18. Similar MPC procedure has also been adopted in literature[32]. In our paper, the reason to adopt MPC here is that the error rates and weights are aggregated statistics of the samples held by each data owner and they should be kept secret to minimize the potential information leakage, while such statistics are independent to the value of the individual sample and they are not necessarily be perturbed. Next, we give an example of computing the sum of weights, data owner 1 adds a random number to and sends it to data owner 2. Data owner 2 adds a random number and to the received value from data owner 1 and sends it to data owner 3. The process continues till data owner adds a random number and to received value from data owner and sends it to data owner 1. Data owner 1 subtracts from the received value from data owner and sends back the result to data owner 2. Data owner 2 subtracts and sends back the result to data owner 3. Process continues till data owner subtracts . Finally, data owner gets the total sum of weights and distributes to all other data owners.
3.4 Local Share Perturbation
In the proposed algorithm, each data owner perturbs the local share and then sends the perturbed share to the data user to fit the base classifier. In our definition, data user should not be able to infer any individual sample from the received share. As mentioned previously, according to the definition of LDP, by seeing a perturbed data instance, it can’t distinguish it from the true data instance with a high confidence (controlled by the parameter ). Thus to protect the privacy of the individual data instance, the randomization method which satisfies LDP is adopted. In the meanwhile, one challenge is also raised as how the utility of the base classifier be maintained by building upon the perturbed data instances. To the best of our knowledge, the statistical information of the perturbed dataset could be maintained with an error bound[41, 11], which gives us a way to build the base classifier. Be more specifically, the statistical information (e.g., the estimated mean) of the perturbed local shares are used to build the base classifier, thus the utility of the classifier could be guaranteed. To achieve the goal, we first introduce the method[40] that is used to perturb the numerical attributes and show the error bound for the mean estimation. For ease of explanation, the individual sample held by a data owner, , is assumed to be perturbed here.
Alg. 3 takes a single numerical value as input and returns a perturbed value , where it confines to a relatively small domain (i.e., ) while it also allows to be close to (i.e., ) with reasonably large probability (i.e., ). For the mean estimation, with at least probability, it can be shown that , which is an error bound with the incurred noise and is asymptotically optimal[11]. In the case that the input domain is not within the range , namely, , the data owner computes and perturbs with Alg. 3, then the data owner releases , where is the perturbed output of . The assumption that is public information is commonly used in the literature[11].
when
is a multidimensional vector, namely,
, Alg. 4 is adopted, which selects attributes, and perturbs each with Alg. 3. The intuition behind Alg. 4 is to reduce the amount of the noise in the estimated mean of , where is the value that chosen to minimize the worstcase noise variance. Comparing to the solution that perturbing attributes with Alg. 3, where each attribute is evenly given a privacy budget of and the total amount of noise in the mean estimation is in , which is superlinear to ; it can be proven that, with Alg. 4, with at least probability, and it is still asymptotically optimal, where the full proof is referred to the original work[40].3.4.1 Privacy Budget Consumption
With Alg.4, we now take a look back of our proposed privacypreserving boosting algorithm and analyze the privacy and utility tradeoff. In Alg. 2, the final classifier is a combination of base classifiers with different voting weights, which are built after rounds of computation. In the nonprivate scenario, the data owner is able to participate in all rounds computation. However, in the LDP setting, the privacy budget is affected by the repeating involvements of a data owner. More specifically, suppose the local share released by the data owner in the th round satisfies differential privacy. By the composition property of Differential Privacy[26], if DP is required to be enforced for the data owner’s data, then it needs that . Considering that the data owner participates all rounds, it becomes . Then the amount of the noise contributed by each data owner becomes , which linearly depends on . Thus to reduce the injected noise, following the suggestion in the existing studies [18, 11], each data owner only participates in at most one round. In particular, as line 910 of Alg. 2 shows, in each round, the data user uniformly at random selects a group of data owners, and asks each of them to submit the perturbed local share. Then the data user aggregates perturbed local shares in each round, which results the amount of the noise in .
3.5 Building Base Classifier
After giving an overview of the privacypreserving boosting algorithm and the perturbation methods, we now turn to the exact classifier that to be fit. For a long time, Boosting Decision Tree (BDT) has been used to demonstrate the effectiveness of boosting, while the idea is actually applied to other types of classifiers. In the subsection, we give three examples of classifiers that could be supported and discuss how to build such classifiers with different types of local shares. Each final classifier consists of base classifiers with different weights, and the procedure to build each base classifier keeps the same but with adjusted sample weights.
3.5.1 Local Samples
A straightforward approach to build the base classifier is to contribute the data samples directly by each data owner, then the data user collects such input and build the classifier at a centralized site. However, to protect the privacy, the data samples should be perturbed before releasing. Due to the noise injected to each data instance, only the statistical information could be well estimated, such as frequency and mean estimation, which makes only certain types of classifiers could be supported. Nearest Centroid Classifier (NCC) is one such classifier could be built upon the perturbed data samples, where the observation is assigned with the label of the class of training samples whose centroid is nearest to that observation. For data owner , is the weighted sample, is the perturbed output of by Alg.4 and is submitted to the data user. For the data user, the centroid of class is computed as . Given an observation , the distance to each class centroid is computed and then the class label of closet centroid, , is assigned to .
3.5.2 Local Classifier
Contributing the perturbed data samples is a straightforward way, but the supported classifiers are limited due to that only the mean of the data samples could be well estimated. As an alternative, each data owner could locally compute an optimal classifier using their own samples. Instead of submitting the data samples, the individual classifier is then perturbed and contributed to the data user. More specifically, assuming the classifier is trained by optimizing the loss function that maps a dimensional parameter vector into a real number, where the parameter vector computed by data owner is identified as:
(3) 
Similarly, is the perturbed output of by Alg.4. Then the data user collects perturbed parameter vectors and obtains the final parameter vector:
(4) 
It is not hard to see that submitting the local classifiers is a superior solution comparing to submitting the training samples, while the accuracy of the estimation is affected by the number of collected samples. Given data owners, there are local classifiers and the estimated error of the aggregation is bounded by . However, each data owner may hold more than one training sample, assuming the total amount of the samples is , when , the estimated error will be far less than aggregation of parameters.
3.5.3 Local Statistic
Another type of base classifiers is built by the statistic information of the data samples. In this case, rather than individual samples or the local classifiers, the data owner computes the statistic of the data samples and submits to the data user. BDT is one such classifier could be built in this way, and we explain the fitting procedure here. BDT is one type of Decision Tree (DT) classifier, which makes the decision by examining a sequence of conditions and it is represented as a tree from the root node to the leaf node. The internal node of the tree represents a condition, and the corresponding value in the observation decides which branch to proceed, till the leaf nodes provides the final decision. Thus building a DT from the root to the leaves needs to pick the proper attribute for each internal node and the process repeats till the tree stop growing. The best attribute to split is determined by the impurity measurement before and after split over that attribute. The impurity measurements usually include information gain[34], Gini index[7] and misclassification error[38], and the former two are commonly used in the DT. Comparing to DT that usually resulted in multiple levels, BDT takes advantage of the decision stumps of one depth and the final classifier makes the final decision by considering the weighted votes of all stumps. And we choose the measurement of misclassification error in this paper, as the previous study[4] shows that the misclassification error fits well in the distributed environment. To summarize, the base classifier in BDT is the decision stump which has a binary branch and to find the best attribute to be split on this stump, we use the misclassification error as the criteria. Next, we describe how to privately find the best attribute based on the misclassification error and to build the decision stump.
For ease of illustration, assuming is a binary attribute, and there are two classes in the learning examples, the cross table of is a statistical summary and is represented as , where denotes the weighted number of examples with and . In short, for all samples with and belong to Class 1, instead of adding the counts, is the added weights of the corresponding samples. The misclassification error impurity of the is defined as , where is the weighted number of samples that . Then it can be shown that for each attribute, maximizing its gain is equivalent to minimizing the weighted sum of impurities[4]:
(5) 
Thus the best attribute is the one that has the minimum classification error, namely, . In our problem, the data user aggregates the cross tables of all attributes from the data owners and to determine the best attribute. More specifically, each data owner prepares two values for each attribute, and submits the perturbed values to the data user, e.g., . Upon receiving the statistics from data owners, the data user sums up the values and computes . Furthermore, since it is only required to decide the attribute that has minimum classification error, and the actual error of each attribute is not necessarily needed. Thus the data user could use the mean estimation to determine the best attributes, namely, . Thus the data user determines the best attributes which has the minimum classification error and broadcasts such attribute to all data owners. Knowing the best attribute, each data owner reports the majority of class labels for each branch to the data user (recall it is a binary attribute), thus the base classifier is determined by the best attribute and the majorities class labels for each branch.
4 Experiment
In this section, we evaluate the proposed privacypreserving boosting algorithm through three types of classifiers, Logistic Regression (LR), Nearest Centroid Classifier (NCC), Boosted Decision Tree (BDT); each classifier corresponds to one subsection of Section 3.5. The utility of the learned classifiers are assessed in terms of the prediction capacity over three public datasets, the MNIST, FashionMNIST^{1}^{1}1https://github.com/zalandoresearch/fashionmnist, and IPUMS dataset. The description of the datasets are given below,
Perturbation  1.0  3.0  5.0  7.0  9.0  
MNIST  Laplace  27.4852 + 0.7686  2.9896 + 0.1591  1.0848 + 0.0801  0.5387 + 0.0302  0.3333 + 0.0148 
Duchi  0.6149 + 0.0367  0.1566 + 0.0100  0.1300 + 0.0073  0.1347 + 0.0086  0.1350 + 0.0075  
PM  0.5104 + 0.0417  0.0381 + 0.0032  0.0299 + 0.0021  0.0139 + 0.0007  0.0129 + 0.0013  
FashionMNIST  Laplace  26.6123 + 0.8602  3.1190 + 0.1216  1.0818 + 0.0787  0.5538 + 0.0332  0.3353 + 0.0130 
Duchi  0.6189 + 0.0203  0.1614 + 0.0077  0.1334 + 0.0126  0.1389 + 0.0066  0.1344 + 0.0099  
PM  0.4861 + 0.0235  0.0361 + 0.0023  0.0280 + 0.0014  0.0134 + 0.0015  0.0120 + 0.0006  
Brazil  Laplace  11.9691 + 1.3442  1.2872 + 0.1661  0.5361 + 0.0904  0.2347 + 0.0295  0.1658 + 0.0286 
Duchi  0.1616 + 0.0227  0.0436 + 0.0070  0.0377 + 0.0075  0.0400 + 0.0059  0.0372 + 0.0027  
PM  0.2050 + 0.0200  0.0100 + 0.0021  0.0105 + 0.0024  0.0034 + 0.0012  0.0043 + 0.0006 
The MNIST dataset contains grayscale images of handwritten digits from 0 to 9, which has 60,000 samples for training and 10,000 samples for testing.
The FashionMNIST dataset contains the grayscale article images that each sample associates with a label from 10 classes. The dataset is intended to replace the overused MNIST dataset, and shares the same image size and structure of training and testing splits as the MNIST dataset.
The Brazil dataset contains census records of Brazil between 1970 and 2010 that from the Integrated Public Use Microdata Series[21] (IPUMS). The dataset has 12M records and 10 attributes, 2 of them are numerical (e.g., age and totalIncome) and the rest of them are categorical. In the experiment, the totalIncome is used as the dependent variable and is converted to binary attribute by mapping the value larger than mean value to 1, and 0 otherwise[40]. The categorical attributes with k distinct values are transformed to k binary attributes as well. After the transformation, the dimensionality of the dataset becomes 67.
For better representation, the MNIST and FashionMNIST dataset are preprocessed with the Histogram of Oriented Gradients (HOG) descriptors, where the descriptor provides the histograms of directions of gradients that are used to capture the edges and corners of an image, which results in a fixed length of floating vector. Then a Bag of Visual Words (BOVW) model is further associated with the HOG descriptors, where the KMeans clustering is performed over the generated descriptors and the center of each cluster is used as the visual dictionary’s vocabularies. Finally, the frequency histogram is made from the visual vocabularies and results in a fixed length of vector. By this way, it is flexible to tune the number of visual vocabularies of both datasets and provides the representation with various dimensionalities.
For a comprehensive study, we further compare to two other existing numerical attributes perturbation method, Laplace mechanism[12] and Duchi et al.’s mechanism. The Laplace mechanism perturbs the single numerical value by adding a random noise , in which follows a Laplace distribution that has 0 mean and scale . And it can be seen that the estimated mean of perturbed values is unbiased and the error is in . However, to perturb multiple attributes, each attribute evenly split the total budget , which is assigned individually and the error of the estimated mean of the multidimensional vector is superlinear to , which will be excessive given a large . Duchi et al.’s solution to perturb multiple numerical attributes[11] is also asymptotically optimal, but has a larger constant than Alg. 4, and the effect in terms of the prediction capacity is demonstrated in the experiments. For consistency, in the figures of this section, Laplace is referred to the Laplace mechanism, Duchi is referred to the Duchi et al.’s mechanism and PM is referred to Alg. 4.
4.1 Logistic Regression
We first present the evaluation result regarding LR implementation. In the experiment, the dataset is randomly shuffled and evenly split into trunks to simulate data owners. Then a local LR classifier is built with each trunk of data and the parameter of the local classifier is perturbed by Alg. 4 to satisfy LDP. The base classifier is attained by taking the mean value of a group of the perturbed classifiers and the prediction capacity is evaluated. Fig. 2 displays the LR accuracy for all three datasets. Each subfigure compares three perturbation methods, the horizontal axis specifies privacy budget , and the vertical axis shows the classification accuracy. From the figure, for all three perturbation methods, it can be seen that the accuracy improves as increases in general, as the incurred noise decreases, and PM among all three perturbation methods achieves the highest accuracy. The reason is that PM has the smallest error in the aggregation of the classifier parameters, and Table. II presents the corresponding MSE of the estimated mean of the parameters. Comparing to Laplace and Duchi, PM has the smallest MSE across all privacy budgets except one setting, and the magnitude of the MSE difference reflects the pattern in Fig. 2. For instance, for the MNIST and FashionMNIST datasets, the MSE of PM is one order of magnitude smaller than Duchi, and two order of magnitude smaller than Laplace, which shows the good utility of the learned base classifier.
Regarding the number of data owners, recall the total noise incurred in the mean estimation by PM is in , where is the number of samples. While aggregating the local classifiers from data owners, the noise is in , since each data owner contributes one vector of parameters to represent the local classifier. And the error becomes less as more data owners involve in the aggregation. To illustrate the point, we fix the amount of the privacy budget and varies the number of data owners to see the effect. Fig. 5 shows the result of the MNIST dataset and Fig. 5 shows the result of the Brazil dataset. In each figure, the horizontal axis specifies the number of data owners, the left vertical axis shows the classification accuracy and the right vertical axis provides the MSE of the mean of the parameter vectors of the classifiers. From Fig. 2, it concludes that Laplace performs worse than the other two perturbation methods, thus in the figures here we only display the comparison between PM and Duchi. From the MNIST dataset, it can be seen that PM always has less noise than Duchi, and the classifiers of former consistently achieve higher accuracy than the latter. From the Brazil dataset, it can be observed that both PM and Duchi achieves similar loss when the number of data owners grows over 10,000. And the experiment results agree with the theoretical analysis. To summarize, the MSE has a substantial drop from hundreds of data owners to tens of thousands of data owners, while classification accuracy improves with a small magnitude. The reason might be the incurred noise has a small influence to the magnitude of the coefficients of LR, which implies the coefficients of LR has a larger tolerance range to errors.
The experiments above demonstrate the utility of the base classifier in one round of Alg. 2, and we didn’t observe accuracy improvements by boosting multiple LR classifiers, the reason might due to the linearity of the model.
4.2 Nearest Centroid Classifier
In this subsection, we present the evaluation result of the NCC. Instead of contributing one vector from each data owner, for NCC, each data owner perturbs each individual data sample and submits all perturbed samples he held. The final classifier is learned by finding the centroid of each class after all perturbed samples have been collected. Fig. 3 shows the experiment result regarding three perturbation methods. And it can be seen that PM performs better than other two perturbation methods for NCC, however, the performance gap between PM and Duchi is larger than the LR, which implies that the parameters of NCC are more sensitive than LR since the decision boundary of NCC is determined by the Euclidean distance between the samples and the centroid, thus the magnitude of the noise has a bigger impact to the prediction capacity. For the Brazil dataset, it can be seen that with the large enough samples sizes (millions), the utility of the learned NCC is well preserved for even with a low privacy budget (1.0).
Fig. 6 presents the classification result regarding various number of collected samples, ranges from 1000 to 9000. Comparing to LR, the impact of the resulted error in the parameters is more significant, the reason is explained above.
Finally, we boost the NCC classifier with Alg. 2 and the result is depicted in Fig. 7. Each subfigure shows the result for one dataset, and boosting the NCC leads a accuracy gain when no noise is injected in general, which proves the effectiveness of the boosting algorithm. Due to the sample size, the number of base classifiers is limit under 10, as each data owner only contribute once during boosting. The figure also plots the accuracy under two privacy budgets, 3.0 and 5.0, and it can be seen that even though the injected noise hurts the utility of the learned base classifier in each round, as the misclassification rate is high compared to the noisefree result, the boosted NCC still gets lower misclassification rate than nonboosted NCC (the first round of boosting), which shows the effectiveness of the proposed algorithm.
4.3 Boosted Decision Tree
We present the experiment result of the BDT classifier in this subsection, the evaluation is performed over the MNIST and a synthetic dataset. From the MNIST dataset, we picked the confusable digit 4 and 9 for the classification; For the synthetic dataset, there are two classes and twenty attributes generated, ten of them are the meaningful attributes, the rest of them are generated as noninformative features that are the linear combination of the ten meaningful ones. And there are synthetic samples generated. Both datasets have a balanced number of positive and negative samples. In the experiment, the cross tables of all attributes from each trunk are computed and perturbed, then the cross tables from all trunks are aggregated and the best attribute is decided by the calculated minimum misclassification error. Thus a decision stump is built over the best attribute in each round. Due to the injected noise, the computed attribute from the perturbed cross tables might not be true best attribute. Thus we first investigate the topk hitting rate of the computed best attribute. Given the nonperturbed cross tables of all attributes, the attributes are sorted based on the misclassification error in an ascendant order. The topk hitting rate is the probability that the computed best attribute from the perturbed cross tables appears in the topk nonperturbed best attributes. The larger k is, the higher hitting rate is. When k is equal to the number of the total attributes, the hitting rate reaches the . In the experiment, the probability is computed over 100 runs.
Fig. 9 and Fig. 9 depict the top1 and top3 hitting rates regarding various privacy budgets for the MNIST binary and synthetic dataset. For a comparison, both Duchi and PM perturbation methods are evaluated, and the result are computed for building one decision stump. For both datasets, the topk hitting rates increase as increases, which implies less noise injected and the estimated error become less. For the binary MNIST dataset, the top3 hitting rates of PM perturbation reaches almost as grows to 5.0, which implies that the probability is around that the computed best attributes from the perturbed cross table is one of the top3 true best attributes; while the top1 hitting rate is around . The reason is that there are 1,000 data owners simulated in the experiment, and the resulted error is larger than the classification error difference among the top3 best attributes. Similarly, Fig. 9 shows the result that there are 10,000 data owners simulated from the synthetic dataset, and the top3 hitting rate is almost as reaches to 3.0. For the classification accuracy, Fig. 11 and 11 plots the misclassification rate as a function of the number of decision stumps in the classifier. For a comparison, the nonprivate BDT is also depicted in the figure, and it confirms the effectiveness of the boosting algorithm. For the BDT classifier built from the perturbed data, it can be seen that the it has a similar prediction capacity compared to the nonprivate one under various number of decision stumps. The result confirms that the privacypreserving BDT successfully maintains a similar utility gain as the nonprivate one.
5 Related Work
Local Differential Privacy (LDP) is firstly proposed by [10], in which it is proposed to eliminate the trusted curator and allow the data owner to control the information releasing in a private manner. Collecting the statistical information is one of great interests in the community, and a randomized response mechanism[43] was invented decades ago to be used in public domain survey. Recently, Wang et al. propose a framework[41] that generalizes several LDP protocols in the literature and analyzes each one regarding the task of frequency estimation. Bassily and Smith [3] develops an asymptotically optimal solution for the succinct histogram building over the categorical domain under LDP. And for the discrete inputs, Kairouz et al. [24] study the tradeoff between the utility and LDP in terms of family of extremal mechanisms. Two simple extremal mechanisms, the binary and randomized response mechanisms, are proposed and shown the theoretical guarantee of the utility. For the realworld application, RAPPOR[15] is the first that practically adopts LDP to collect the statistics from the end users, which is designed by Google. Apple and Microsoft also propose their own tools to use LDP in collecting usage, typing history[39] and telemetry data[9]. Lately, a Encode, Shuffle, Analysis (ESA) system architecture[5] is proposed to support a wide range of analysis tasks privately and is used in the software monitoring implementation. All above works focus on the applications based on frequency estimation, where the underlying data are more likely categorical or discrete, while our problem focuses on the numerical data.
Under LDP, several data mining and machine learning problems have been studied, such as probability distribution estimation
[31, 23, 27, 46], frequent itemset mining [42], Bayes learning[47] and clustering [29]. Furthermore, Stochastic Gradient Descent (SGD) that satisfying
LDP have been used in many applications. Wang et al.[40]proposes an piecewise mechanism for multidimension numerical attributes perturbation and demonstrate its usage in empirical risk minimization In the paper, the authors demonstrate the computation with three machine learning models, the linear regression, logistic regression and SVM, where multiple data owners locally compute and perturb the gradients of the parameters, then shares the noisy version to the aggregator. Another clientserver machine learning framework is proposed to learn the Generalized Linear models
[33], where the server simultaneous delivers the models to the distributed clients and asks for the parameter update. More specifically, the server maintains versions of the machine learning model, and randomly selects one version to update the parameters and replaces another version. To enforce the LDP, the client uses Laplace mechanism to perturb the parameters and returns the model to the server. Zhang et al. [48]consider the problem of multiparty deep learning, wherein multiple data owners train a local deep neural network model and then privately share the model for aggregation, the
LDP is enforced by the Laplace mechanism before each data owner share the local gradients. The difference between our work and these works is the machine learning model, we are interested in the privacy protection during the boosting procedure and the type of classifiers that could be supported under the privacyaware manner, for example, the BDT classifier, which is built by aggregating the statistical information and is hardly built through the SGD approach.Privacypreserving model sharing and aggregation has been studied in several literature. Pathak et al.[32] propose a protocol of composing a differentially private global classifier (e.g., Logistic Regression classifier) by aggregating the classifiers locally trained at the data owner. To protect each individual data sample, the secure protocol utilizes secure multiparty computation (SMC) to aggregate the local classifiers and the global classifier is then perturbed to satisfy the DP. Decision tree induction in the distributed setting has been studied in many applications, Caragea et al.[8] formulates the problem of learning from distributed data and proposes a generate strategy to transform the machine learning algorithm into distributed setting. As an example, for the decision tree induction, the local statistics are collected at the centralized site and then the best splitting attribute is decided by maximizing the gain from the aggregated statistics. Another work by BarOr et al.[2] optimizes the communication overhead of the exchanging statistics by filtering out the attributes based on the bounds of the information gain and develops a distributed hierarchical decision tree algorithm. Bhaduri et al.[4] develops a scalable distribute algorithm for decision tree induction in the peertopeer environment, which works in a completely asynchronous manner and has a low communication overhead. Comparing to regular decision tree, Boosted Decision Tree (BDT) limits the depth of the tree to one and shares the same strategy to split the node. Gambs et al.[17] introduces the privacypreserving boosting algorithm based on MPC, but the algorithm doesn’t consider the information disclosure of individual learning example, thus the technique (e.g., without adding noise for differential privacy) gives inadequate privacy protection, in contrast, our privacypreserving boosting algorithm provides a stronger solution that prevents the information leakage.
6 Conclusion
In this paper, we propose a privacypreserving boosting algorithm, and there are two types of parties in our scenario, a data user intends to run the boosting algorithm and learn a classifier and multiple data owners that are willing to contribute the data for the learning purpose but the with privacy protection. The proposed algorithm guarantees that the data user cannot learn the individual training sample from the data owner. The assurance is achieved by the Local Differentially Private (LDP) mechanism, where each data owner perturbs their local data share and then contributes the perturbed data to the data user. Due to the noise injected at individual training sample, only the statistic information of the aggregated data could be well preserved, e.g., mean estimations. With the proposed boosting algorithm, we implement the BDT classifier by utilizing the statistical information from the perturbed data and analyze the privacy and utility tradeoff. Furthermore, we also study two other types of classifiers that could be supported and the corresponding data shares to be adopted, e.g., Nearest Centroid Classifier and Logistic Regression. In the experiment, we investigate the utility of the learned classifiers in terms of the classification capacity through multiple real datasets, and compare to two existing perturbation methods, the result shows that the learned classifiers by our method effectively maintain a superior utility.
References
 [1] (1998) Text mining with decision rules and decision trees. Citeseer. Cited by: §1.
 [2] (2005) Hierarchical decision tree induction in distributed genomic databases. IEEE Transactions on Knowledge and Data Engineering. Cited by: §5.

[3]
(2015)
Local, private, efficient protocols for succinct histograms.
In
Proceedings of the 47 annual ACM symposium on Theory of computing
, Cited by: §5. 
[4]
(2008)
Distributed decisiontree induction in peertopeer systems.
Statistical Analysis and Data Mining: The ASA Data Science Journal
. Cited by: §3.5.3, §3.5.3, §5.  [5] (2017) Prochlo: strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, Cited by: §5.
 [6] (1996) Bagging predictors. Machine learning. Cited by: §1.
 [7] (2017) Classification and regression trees. Routledge. Cited by: §3.5.3.
 [8] (2004) A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. International Journal of Hybrid Intelligent Systems. Cited by: §5.
 [9] (2017) Collecting telemetry data privately. In Advances in Neural Information Processing Systems, Cited by: §5.
 [10] (2013) Local privacy and statistical minimax rates. In IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), Cited by: §1, §2.2, §2.2, §5.
 [11] (2018) Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association. Cited by: §3.4.1, §3.4, §3.4, §4.
 [12] (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, Cited by: §4.

[13]
(2014)
Analyze gauss: optimal bounds for privacypreserving principal component analysis
. In 46th Annual ACM Symposium on Theory of Computing, Cited by: §2.1.  [14] (2006) Differential privacy. In 33 International Conference on Automata, Languages and Programming, Cited by: §1, §2.1.
 [15] (2014) Rappor: randomized aggregatable privacypreserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, Cited by: §5.
 [16] (1997) A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences. Cited by: §1, §2.3.
 [17] (2007) Privacypreserving boosting. Data Mining and Knowledge Discovery. Cited by: §5.
 [18] (2015) Crowdml: a privacypreserving learning framework for a crowd of smart devices. In IEEE 35th International Conference on Distributed Computing Systems, Cited by: §3.4.1.
 [19] (1990) Neural network ensembles. IEEE Transactions on Pattern Analysis & Machine Intelligence. Cited by: §1.
 [20] (2009) Multiclass adaboost. Statistics and its Interface. Cited by: §2.3.
 [21] (2019) Integrated public use microdata series. Note: https://www.ipums.org Cited by: §4.
 [22] (2014) Differential privacy and machine learning: a survey and review. arXiv preprint arXiv:1412.7584. Cited by: §1.
 [23] (2016) Discrete distribution estimation under local privacy. Proceedings of the 33rd International Conference on Machine Learning. Cited by: §5.
 [24] (2014) Extremal mechanisms for local differential privacy. In Advances in neural information processing systems, Cited by: §5.
 [25] (2004) Privacypreserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge & Data Engineering. Cited by: §3.3.
 [26] (2007) Mechanism design via differential privacy. In FOCS, Cited by: §3.4.1.
 [27] (2018) Toward distribution estimation under local differential privacy with small samples. Proceedings on Privacy Enhancing Technologies. Cited by: §5.
 [28] (2008) Robust deanonymization of large sparse datasets. In IEEE Symposium on Security and Privacy, Cited by: §1.
 [29] (2017) Clustering algorithms for the centralized and local models. arXiv preprint arXiv:1707.04766. Cited by: §5.
 [30] (2003) An assessment of the effectiveness of decision tree methods for land cover classification. Remote sensing of environment. Cited by: §1.
 [31] (2016) Locally differentiallyprivate distribution estimation. In IEEE International Symposium on Information Theory (ISIT), Cited by: §5.
 [32] (2010) Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems, Cited by: §3.3, §5.
 [33] (2018) Differentiallyprivate ’draw and discard’ machine learning. arXiv preprint arXiv:1807.04369. Cited by: §5.
 [34] (1986) Induction of decision trees. Machine learning. Cited by: §3.5.3.
 [35] (1990) The strength of weak learnability. Machine learning. Cited by: §1, §2.3.
 [36] (1979) How to share a secret. Communications of the ACM. Cited by: §3.3.
 [37] (2007) Privacy protection in personalized search. In ACM Special Interest Group in Information Retrieval Forum, Cited by: §1.
 [38] (2018) Introduction to data mining. Pearson Education India. Cited by: §3.5.3.
 [39] (2017) Learning with privacy at scale. Cited by: §5.
 [40] (2019) Collecting and analyzing multidimensional data with local differential privacy. In IEEE 35th International Conference on Data Engineering (ICDE), Cited by: §3.4, §3.4, §4, §5, Algorithm 3, Algorithm 4.
 [41] (2017) Locally differentially private protocols for frequency estimation. In Proceedings of the 26th USENIX Security Symposium, Cited by: §1, §2.2, §3.4, §5.
 [42] (2018) Locally differentially private frequent itemset mining. In IEEE Symposium on Security and Privacy (SP), Cited by: §5.
 [43] (1965) Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association. Cited by: §5.
 [44] (1992) Stacked generalization. Neural networks. Cited by: §1.
 [45] (2017) A boosted decision tree approach using bayesian hyperparameter optimization for credit scoring. Expert Systems with Applications. Cited by: §1.
 [46] (2018) Optimal schemes for discrete distribution estimation under locally differential privacy. IEEE Transactions on Information Theory. Cited by: §5.

[47]
(2019)
Locally differentially private naive bayes classification
. arXiv preprint arXiv:1905.01039. Cited by: §5.  [48] (2017) Private, yet practical, multiparty deep learning. In IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Cited by: §5.
Comments
There are no comments yet.