1 Introduction
Gradient Boosting Decision Trees (GBDTs) have achieved stateoftheart results on many challenging machine learning tasks such as click prediction [21], learning to rank [3], and web page classification [19]. The algorithm builds a number of decision trees one by one, where each tree tries to fit the residual of the previous trees. With the development of efficient GBDT libraries [5, 14, 20, 26], the GBDT model has won many awards in recent machine learning competitions and has been widely used both in the academics and in the industry [12, 29, 14, 13, 8]. Privacy issues have been a hot research topic recently [23, 25, 9, 15]. Due to the popularity and wide adoptions of GBDTs, a privacypreserving GBDT algorithm is particularly timely and necessary. Differential privacy [7]
was proposed to protect the individuals of a dataset. In a nutshell, a computation is differentially private if the probability of producing a given output does not depend much on whether a particular record is included in the input dataset. Differential privacy has been widely used in many machine learning models such as logistic regression
[4]and neural networks
[1, 2]. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Many practical differentially private models achieve good model utility by deriving tight sensitivity bounds and allocating privacy budget effectively. In this paper, we study how to improve model accuracy of GBDTs while preserving the strong guarantee of differential privacy. There have been some potential solutions for improving the effectiveness of differentially private GBDTs (e.g., [28, 16, 27]). However, they can suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Sensitivity bounds: The previous studies on individual decision trees [10, 17, 16]bound the sensitivities by estimating the range of the function output. However, this method leads to very loose sensitivity bounds in GBDTs, because the range of the gain function output (
in Equation (3) introduced Section 2) is related to the number of instances and the range can be potentially very huge for large data sets. Loose sensitivity bounds lead to more noise to obtain a fixed privacy level, and cause huge accuracy loss. Privacy budget allocations: There have been some previous studies on privacy budget allocations among different trees [16, 27, 28]. We can basically divide them into two kinds. 1) The first kind is to allocate the budget equally to each tree using the sequential composition [16, 27]. When the number of trees is large, the given budget allocated to each tree is very small. The scale of the noises can be proportional to the number of trees, which causes huge accuracy loss. 2) The second kind is to give disjoint inputs to different trees [28]. Then, each tree only needs to satisfy differential privacy using the parallel composition. When the number of trees is large, since the inputs to the trees cannot be overlapped, the number of instances assigned to a tree can be quite small. As a result, the tree is too weak to achieve meaningful learnt models. We design a new GBDT training algorithm to address the abovementioned limitations.
In order to obtain a tighter sensitivity bound, we propose Gradientbased Data Filtering (GDF) to guarantee the bounds of the sensitivities, and further propose Geometric Leaf Clipping (GLC) to obtain a closer bound on sensitivities taking advantage of the tree learning systems in GBDT.

Combining both sequential and parallel compositions, we design a novel boosting framework to well exploit the privacy budget of GBDTs and the effect of boosting. Our approach satisfies differential privacy while improving the model accuracy with boosting.

We have implemented our approach (named DPBoost) based on a popular library called LightGBM [14]. Our experimental results show that DPBoost is much superior to the other approaches and can achieve competitive performance compared with the ordinary LightGBM.
2 Preliminaries
2.1 Gradient Boosting Decision Trees
The GBDT is an ensemble model which trains a number of decision trees in a sequential manner. Formally, given a convex loss function
and a dataset with instances and features , GBDT minimizes the following regularized objective [5].(1) 
where is a regularization term. Here is the regularization parameter and is the leaf weight. Each corresponds to a decision tree. Forming an approximate function of the loss, GBDT minimizes the following objective function at the th iteration [24].
(2) 
where is first order gradient statistics on the loss function. The decision tree is built from the root until reaching the maximum depth. Assume and are the instance sets of left and right nodes after a split. Letting , the gain of the split is given by
(3) 
GBDT traverses all the feature values to find the split that maximizes the gain. If the current node does not meet the requirements of splitting (e.g., achieve the max depth or the gain is smaller than zero), it becomes a leaf node and the optimal leaf value is given by
(4) 
Like the learning rate in stochastic optimization, a shrinkage rate [11] is usually applied to the leaf values, which can reduce the influence of each individual tree and leave space for future trees to improve the model.
2.2 Differential Privacy
Differential privacy [7] is a popular standard of privacy protection with provable privacy guarantee. It guarantees that the probability of producing a given output does not depend much on whether a particular record is included in the input dataset or not.
Definition 1.
(Differential Privacy) Let be a positive real number and be a randomized function. The function is said to provide differential privacy if, for any two datasets and that differ in a single record and any output of function ,
(5) 
Here is a privacy budget. To achieve differential privacy, the Laplace mechanism and exponential mechanism [6] are usually adopted by adding noise calibrated to the sensitivity of a function.
Definition 2.
(Sensitivity) For any function , the sensitivity of w.r.t. is
(6) 
where and have at most one different record.
Theorem 1.
(Laplace Mechanism) For any function , the Laplace Mechanism for any dataset
(7) 
where the noise is drawn from a Laplace distribution with mean zero and scale , provides differential privacy.
Theorem 2.
(Exponential Mechanism) Given the utility function , the exponential mechanism for any dataset ,
(8)  
provides differential privacy.
The abovementioned mechanisms provide privacy guarantees for a single function. For an algorithm with multiple functions, there are two privacy budget composition theorems [6].
Theorem 3.
(Sequential Composition) If a series of functions , in which provides differential privacy, are performed sequentially on a dataset, will provide differential privacy.
Theorem 4.
(Parallel Composition) If a series of functions , in which provides differential privacy, are performed separately on disjoint subsets of the entire dataset, will provide differential privacy.
3 Our Design: DPBoost
Given a privacy budget and a dataset with instances and features , we develop a new GBDT training system named DPBoost to achieve differential privacy while trying to reduce the accuracy loss. Moreover, like the setting in the previous work [1], we consider a strong adversary with full access to the model’s parameters. Thus, we also provide differential privacy guarantees for each tree node. Figure 1 shows the overall framework of DPBoost. We design a novel twolevel boosting framework to exploit both sequential composition and parallel composition. Inside an ensemble, a number of trees are trained using the disjoint subsets of data sampled from the dataset. Then, multiple such rounds are trained in a sequential manner. For achieving differential privacy, sequential composition and parallel composition are applied between ensembles and inside an ensemble, respectively. Next, we describe our algorithm in detail, including the techniques to bound sensitivities and effective privacy budget allocations.
3.1 Tighter Sensitivity Bounds
The previous studies on individual decision trees [10, 17, 16] bound the sensitivities by estimating the range of the function output. For example, if the range of a function output is , then the sensitivity of the function is no more than . However, the range of function in Equation (3) is related to the number of instances, which can cause very large sensitivity if the dataset is large. Thus, instead of estimating the range, we strictly derive the sensitivity ( and ) according to Definition 6. Their bounds are given in the below two lemmas.
Lemma 1.
Letting , we have
Proof.
Consider two adjacent instance sets and that differ in a single instance. Assume , where and are the instance sets of leaf and right nodes respectively after a split. Without loss of generality, we assume that instance belongs to the left node. We use to denote . Then, we have
(9)  
Let . When and , can achieve maximum. We have
(10)  
∎
Lemma 2.
Letting , we have .
Proof.
The proof follows a similar way with the proof of Lemma 1. The detailed proof is available in Appendix A. ∎
For ease of presentation, we call the absolute value of the gradient as 1norm gradient. As we can see from Lemma 1 and Lemma 2
, the sensitivities of nodes are related to the maximum 1norm gradient. Since there is no a priori bound on the value of the gradients, we have to restrict the range of gradients. A potential solution is to clip the gradient by a threshold, which is often adopted in deep learning
[1, 22]. However, in GBDTs, since the gradient is computed based on the target value, clipping the gradients means changing the target value, which can eventually lead to a huge accuracy loss. Here, we propose a new approach named gradientbased data filtering. The basic idea is to restrict the maximum 1norm gradient by only filtering a very small fraction of the training dataset in each iteration.Gradientbased Data Filtering (GDF)
At the beginning of the training, the gradient of instance is initialized as . We let , which is the maximum possible 1norm gradient in the initialization. Note that is independent to training data and only depends on the loss function (e.g., for square loss function). Since the loss function is convex (i.e., the gradient is monotonically nondecreasing), the values of the 1norm gradients tend to decrease as the number of trees increases in the training. Consequently, as we have shown the experimental results in Appendix B, most instances have a lower 1norm gradient than during the whole training process. Thus, we can filter the training instances by the threshold . Specifically, at the beginning of each iteration, we filter the instances that have 1norm gradient larger than (i.e., those instances are not considered in this iteration). Only the remaining instances are used as the input to build a new differentially private decision tree in this iteration. Note that the filtered instances may still participate in the training of the later trees. With such gradientbased data filtering technique, we can ensure that the gradients of the used instances are no larger than . Then, according to Lemma 1 and Lemma 2, we can bound the sensitivities of and as shown in Corollary 1.
Corollary 1.
By applying GDF in the training of GBDTs, we have and .
In the following, we analyze the approximation error of GDF.
Theorem 5.
Given an instance set , suppose , where is the filtered instance set and is the remaining instance set in GDF. Let and . We denote the approximation error of GDF on leaf values as . Then, we have .
Proof.
According to Theorem 5, we have the following discussions: (1) The upper bound of the approximation error of GDF does not depend on the number of instances. This good property allows small approximation errors even on large data sets. (2) Normally, most instances have gradient values lower than the threshold and the ratio is low, as also shown in Appendix B. Then, the approximation error is small in practice. (3) The approximation error may be large if
is big. However, the instances with a very large gradient are often outliers in the training data set. Since GBDTs are trained to minimize the total loss, these outliers cannot be well learned by the trees anyway. Thus, it is reasonable to learn a tree by filtering those outliers.
Geometric Leaf Clipping (GLC)
GDF provides the same sensitivities for all trees. Since the gradients tend to decrease from iteration to iteration in the training process, there is an opportunity to derive a tighter sensitivity bound as the iterations go. However, it is too complicated to derive the exact decreasing pattern of the gradients in practice. Also, as discussed in the previous section, gradient clipping with an inappropriate decaying threshold can lead to huge accuracy loss. We need a new approach for controlling this decaying effect across different tree learning. Note, while the noises injected in the internal nodes influence the gain of the current split, the noises injected on the leaf value directly influence the prediction value. Here we focus on bounding the sensitivity of leaf nodes. Fortunately, according to Equation (
4), the leaf values also decrease as the gradients decrease. Since the GBDT model trains a tree at a time to fit the residual of the trees that precede it, clipping the leaf nodes would mostly influence the convergence rate but not the objective of GBDTs. Thus, we propose adaptive leaf clipping to achieve a decaying sensitivity on the leaf nodes. Since it is unpractical to derive the exact decreasing pattern of the leaf values in GBDTs, we start with a simple case and further analyze its findings in practice.Theorem 6.
Consider a simple case that each leaf has only one single instance during the GBDT training. Suppose the shrinkage rate is . We use to denote the leaf value of the th tree in GBDT. Then, we have .
Proof.
For simplicity, we assume the label of the instance is 1 and the gradient of the instance is initialized as . For the first tree, we have . Since the shrinkage rate is , the improvement of the prediction value on the first tree is . Thus, we have
(12)  
In the same way, we can get . ∎
Although the simple case in Theorem 6 may not fully reflect decaying patterns of the leaf value in practice, it can give an insight on the reduction of the leaf values as the number of trees increases. The leaf values in each tree form a geometric sequence with base and common ratio . Based on this observation, we propose geometric leaf clipping. Specifically, in the training of the tree in iteration in GBDTs, we clip the leaf values with the threshold before applying Laplace mechanism (i.e., ). That means, if the leaf value is larger than the threshold, its value is set to be the threshold. Then, combining with Corollary 1, we get the following result on bounding the sensitivity on each tree in the training process.
Corollary 2.
With GDF and GLC, the sensitivity of leaf nodes in the tree of the th iteration satisfies .
We have conducted experiments on the effect of geometric clipping, which are shown in Appendix B. Our experiments show that GLC can effectively improve the performance of DPBoost.
3.2 Privacy Budget Allocations
As in Introduction, previous approaches [16, 27, 28] suffer from accuracy loss, due to the ineffective privacy budget allocations across trees. The accuracy loss can be even bigger, when the number of trees in GBDT is large. For completeness, we first briefly present the mechanism for building a single tree with a given privacy budget , by using an approach in the previous study [17, 28]. Next, we present our proposed approach for budget allocation across trees in details.
Budget Allocation for A Single Tree
Algorithm 1 shows the procedure of learning a differentially private decision tree. In the beginning, we use GDF (introduced in Section 3.1) to preprocess the training dataset. Then, the decision tree is built from root until reaching the maximum depth. For the internal nodes, we adopt the exponential mechanism when selecting the split value. Considering the gain as the utility function, the feature value with higher gain has a higher probability to be chosen as the split value. For the leaf nodes, we first clip the leaf values using GLC (introduced in Section 3.1). Then, the Laplace mechanism is applied to inject random noises to the leaf values. For the privacy budget allocation inside a tree, we adopt the mechanism in the existing studies [17, 28]. Specifically, we allocate a half of the privacy budget for the leaf nodes (i.e., ), and then equally divide the remaining budget to each depth of the internal nodes (each level gets ).
Theorem 7.
The output of Algorithm 1 satisfies differential privacy.
Proof.
Since the nodes in one depth have disjoint inputs, according to the parallel composition, the privacy budget consumption in one depth only need to be counted once. Thus, the total privacy budget consumption is no more than . ∎
Budget Allocation Across trees
We propose a twolevel boosting structure called Ensemble of Ensembles (EoE), which can exploit both sequential composition and parallel composition to allocate the privacy budget between trees. Within each ensemble, we first train a number of trees with disjoint subsets sampled from the dataset . Thus, the parallel composition is applied inside an ensemble. Then, multiple such ensembles are trained in a sequential manner using the same training set . As a result, the sequential composition is applied between ensembles. Such a design can utilize the privacy budget while maintaining the effectiveness of boosting. EoE can effectively address the side effect of geometric leaf clipping in some cases, which cause the leaf values to have a too tight restriction as the iteration grows. Algorithm 2 shows our boosting framework. Given the total number of trees and the number of trees inside an ensemble , we can get the total number of ensembles . Then, the privacy budget for each tree is . When building the th differentially private decision tree, we first calculate the position of the tree in the ensemble as . Since the maximum leaf value of each tree is different in GLC, to utilize the contribution of the front trees, the number of instances allocated to a tree is proportional to its leaf sensitivity. Specifically, we randomly choose unused instances from the dataset as the input, where is initialized to the entire training dataset at the beginning of an ensemble. Each tree is built using TrainSingleTree in Algorithm 1. All the trees are trained one by one, and these trees constitute our final learned model.
Theorem 8.
The output of Algorithm 2 satisfies differential privacy.
Proof.
Since the trees in an ensemble have disjoint inputs, the privacy budget consumption of an ensemble is still due to the parallel composition. Since there are ensembles in total, according to sequential composition, the total privacy budget consumption is . ∎
4 Experiments
In this section, we evaluate the effectiveness and efficiency of DPBoost. We compare DPBoost with three other approaches: 1) NP (the ordinary GBDT): Train GBDTs without privacy concerns. 2) PARA: A recent approach [28] that adopts parallel composition to train multiple trees, and uses only a half of unused instances when training a differentially private tree. 3) SEQ: we extend the previous approach on decision trees [16] that aggregates differentially private decision trees using sequential composition. Since the original study does not provide sensitivity bounds in GBDTs [16], we set and in SEQ using our GDF technique. We implemented DPBoost based on LightGBM^{1}^{1}1https://github.com/microsoft/LightGBM, and the code is available in the supplementary material for reproducibility. Our experiments are conducted on a machine with one Xeon W2155 10 core CPU. We use 10 public datasets in our evaluation. The details of the datasets are summarized in Table 1. There are eight realworld datasets and two synthetic datasets (i.e., synthetic_cls and synthetic_reg). The realworld datasets are available from the LIBSVM website^{2}^{2}2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The synthetic datasets are generated using scikitlearn^{3}^{3}3https://scikitlearn.org/stable/datasets/index.html#samplegenerators [18]. We show test errors and RMSE (root mean square error) for the classification and regression task, respectively. The maximum depth is set to 6. The regularization parameter is set to 0.1. We adopt the log loss function for classification and the square loss function for regression, and for both cases. We use 5fold crossvalidation for model evaluation. The number of trees inside an ensemble is set to 50 in DPBoost. We have also tried the other settings for the number of trees inside an ensemble (e.g., 20 and 40). The experiments are available in Appendix C.
datasets  #data  #features  task 

adult  32,561  123  classification 
realsim  72,309  20,958  
covtype  581,012  54  
susy  5,000,000  18  
codrna  59,535  8  
webdata  49,749  300  
synthetic_cls  1,000,000  400  
abalone  4,177  8  regression 
YearPredictionMSD  463,715  90  
synthetic_reg  1,000,000  400 
4.1 Test Errors
We first set the number of ensembles to one in DPBoost and the number of trees to 50 for all approaches. Figure 2 shows the test errors of four approaches with different . We have the following observations. First, SEQ performs very badly on all the datasets. The test errors are around 50% in the classification task. With only sequential composition, each tree in SEQ gets a very small privacy budget. The noises in SEQ are huge and lead to high test errors in the prediction. Second, DPBoost can always outperform PARA and SEQ especially when the given budget is small. DPBoost outperforms PARA, mainly because our tightening bounds on sensitivity allows us to use a smaller noise to achieve differential privacy.
When the privacy budget is one, DPBoost can achieve 10% lower test errors on average in the classification task and significant reduction on RMSE in the regression tasks. Moreover, the variance of DPBoost is usually close to zero. DPBoost is very stable compared with SEQ and PARA. Last, DPBoost can achieve competitive performance compared with NP. The results of DPBoost and NP are quite close in many cases
, which show high model utility of our proposed differentiallyprivate design.To show the effect of boosting, we increase the number of ensembles to 20 and the maximum number of trees to 1000. The privacy budget for each ensemble is set to 5. For fairness, the total privacy budget for SEQ and PARA is set to 100 to achieve the same privacy level as DPBoost. We choose the first five datasets as representatives. Figure 3 shows the convergence curves of four approaches. First, since the privacy budget for each tree is still small, the errors of SEQ are very high. Second, since PARA takes a half of the unused instances at each iteration, it can only train a limited number of trees until the unused instances are too few to train an effective tree (e.g., about 20 trees for dataset SUSY). Then, the curve of PARA quickly becomes almost flat and the performance cannot increase as the iteration grows even given a larger total privacy budget. Last, DPBoost has quite good behavior of reducing test errors as the number of trees increases. DPBoost can continue to decrease the accuracy loss and outperform PARA and SEQ even more, which demonstrate the effectiveness of our privacy budget allocation. Also, DPBoost can preserve the effect of boosting well.
datasets  DPBoost  NP 

adult  0.019  0.007 
realsim  2.97  0.82 
covtype  0.085  0.044 
SUSY  0.38  0.32 
codrna  0.016  0.009 
webdata  0.032  0.013 
synthetic_cls  1.00  0.36 
abalone  2.95  2.85 
YearPrediction  0.38  0.12 
synthetic_reg  0.96  0.36 
4.2 Training Time Efficiency
We show the training time comparison between DPBoost and NP. The computation overhead of our approach mainly comes from the exponential mechanism, which computes a probability for each gain. Thus, this overhead depends on the number of split values and increases as the number of dimensions of training data increases. Table 2 shows the average training time per tree of DPBoost and NP. The setting is the same as the second experiment of Section 4.1. The training time per tree of DPBoost is comparable to NP in many cases (meaning that the overhead can be very small), or about 2 to 3 times slower than NP in other cases. Nevertheless, the training of DPBoost is very fast. The time per tree of DPBoost is no more than 3 seconds in those 10 datasets.
5 Conclusions
Differential privacy has been an effective mechanism for protecting data privacy. Since GBDT has become a popular and widely used training system for many machine learning and data mining applications, we propose a differentially private GBDT algorithm called DPBoost. It addresses the limitations of previous works on serious accuracy loss due to loose sensitivity bounds and ineffective privacy budget allocations. Specifically, we propose gradientbased data filtering and geometric leaf clipping to control the training process in order to tighten the sensitivity bound. Moreover, we design a twolevel boosting framework to well exploit both the privacy budget and the effect of boosting. Our experiments show the effectiveness and efficiency of DPBoost.
Acknowledgements
This work is supported by a MoE AcRF Tier 1 grant (T1 251RES1824) and a MOE Tier 2 grant (MOE2017T21122) in Singapore.
References
 [1] (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1, §3.1, §3.
 [2] (2018) Differentially private mixture of generative neural networks. IEEE Transactions on Knowledge and Data Engineering 31 (6), pp. 1109–1121. Cited by: §1.
 [3] (2010) From ranknet to lambdarank to lambdamart: an overview. Learning 11 (23581), pp. 81. Cited by: §1.
 [4] (2009) Privacypreserving logistic regression. In Advances in neural information processing systems, pp. 289–296. Cited by: §1.
 [5] (2016) Xgboost: a scalable tree boosting system. In SIGKDD, pp. 785–794. Cited by: §1, §2.1.
 [6] (2014) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §2.2, §2.2.
 [7] (2011) Differential privacy. Encyclopedia of Cryptography and Security, pp. 338–340. Cited by: §1, §2.2.
 [8] (2018) Multilayered gradient boosting decision trees. In Advances in neural information processing systems, pp. 3551–3561. Cited by: §1.
 [9] (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333. Cited by: §1.
 [10] (2010) Data mining with differential privacy. In SIGKDD, pp. 493–502. Cited by: §1, §3.1.
 [11] (2002) Stochastic gradient boosting. Computational statistics & data analysis 38 (4), pp. 367–378. Cited by: §2.1.
 [12] (2014) Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §1.
 [13] (2018) Dimboost: boosting gradient boosting decision tree to higher dimensions. In Proceedings of the 2018 International Conference on Management of Data, pp. 1363–1376. Cited by: §1.
 [14] (2017) Lightgbm: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: 3rd item, §1.
 [15] (2019) Federated learning systems: vision, hype and reality for data privacy and protection. arXiv preprint arXiv:1907.09693. Cited by: §1.
 [16] (2018) Differentially private classification with decision tree ensemble. Applied Soft Computing 62, pp. 807–816. Cited by: §1, §3.1, §3.2, §4.
 [17] (2011) Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 493–501. Cited by: §1, §3.1, §3.2, §3.2.
 [18] (2011) Scikitlearn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.
 [19] (2011) A machine learning approach to twitter user classification. In Fifth International AAAI Conference on Weblogs and Social Media, Cited by: §1.
 [20] (2018) CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pp. 6638–6648. Cited by: §1.
 [21] (2007) Predicting clicks: estimating the clickthrough rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521–530. Cited by: §1.
 [22] (2015) Privacypreserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §3.1.
 [23] (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1.
 [24] (2017) Gradient boosted decision trees for high dimensional sparse output. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3182–3190. Cited by: §2.1.
 [25] (2018) Towards demystifying membership inference attacks. arXiv preprint arXiv:1807.09173. Cited by: §1.

[26]
(2019)
ThunderGBM: fast GBDTs and random forests on GPUs
. https://github.com/XtraComputing/thundergbm. Cited by: §1.  [27] (2018) Collaborative ensemble learning under differential privacy. In Web Intelligence, Vol. 16, pp. 73–87. Cited by: §1, §3.2.
 [28] (2018) InPrivate digging: enabling treebased distributed data mining with differential privacy. In IEEE INFOCOM 2018IEEE Conference on Computer Communications, pp. 2087–2095. Cited by: §1, §3.2, §3.2, §4.
 [29] (2017) PSMART: parameter server based multiple additive regression trees system. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 879–880. Cited by: §1.
Appendix A Proof of Lemma 2
Proof.
Consider two adjacent instance sets and that differ in a single instance. We have
(13)  
When and , the above function can achieve maximum. Thus, we have
(14) 
∎
Appendix B Experimental Study on GDF and GLC
We use the regression tasks as the examples to study the effect of gradientbased data filtering and geometric leaf clipping. Table 3 shows the number of instances using gradientbased data filtering. As we can see, the percentage of the filtered instances is small, which is no more than 8%. Thus, the filtering strategy will not produce much approximation error according to Theorem 5.
w/ GDF  w/o GDF  filtered ratio  
abalone  3340  3292  1.44% 
YearPredictionMSD  370902  340989  8.06% 
sklearn_reg  800000  799999  0 
Figure 4 shows the results with and without geometric leaf clipping in our DPBoost. Except for one case that budget is equal to 1 in adult, the geometric leaf clipping can always improve the performance of DPBoost. The improvement is quite significant in yearpredicionmsd and synthetic_reg.
Appendix C Additional Experiments
Here we show the results of a different number of trees inside an ensemble (i.e., 20 and 40). The results of one ensemble are shown in Figure 5 and Figure 6. As we can see, DPBoost can always outperform SEQ and PARA especially when the given budget is small. Furthermore, the accuracy of DPBoost is close to NP. Then we set the maximum number of trees to 1000. The results are shown in Figure 7 and Figure 8. Still, DPBoost can well exploit the effect of boosting.