SecureGBM: Secure Multi-Party Gradient Boosting

11/27/2019 ∙ by Zhi Fengy, et al. ∙ 0

Federated machine learning systems have been widely used to facilitate the joint data analytics across the distributed datasets owned by the different parties that do not trust each others. In this paper, we proposed a novel Gradient Boosting Machines (GBM) framework SecureGBM built-up with a multi-party computation model based on semi-homomorphic encryption, where every involved party can jointly obtain a shared Gradient Boosting machines model while protecting their own data from the potential privacy leakage and inferential identification. More specific, our work focused on a specific "dual–party" secure learning scenario based on two parties – both party own an unique view (i.e., attributes or features) to the sample group of samples while only one party owns the labels. In such scenario, feature and label data are not allowed to share with others. To achieve the above goal, we firstly extent – LightGBM – a well known implementation of tree-based GBM through covering its key operations for training and inference with SEAL homomorphic encryption schemes. However, the performance of such re-implementation is significantly bottle-necked by the explosive inflation of the communication payloads, based on ciphertexts subject to the increasing length of plaintexts. In this way, we then proposed to use stochastic approximation techniques to reduced the communication payloads while accelerating the overall training procedure in a statistical manner. Our experiments using the real-world data showed that SecureGBM can well secure the communication and computation of LightGBM training and inference procedures for the both parties while only losing less than 3 boosting, on a wide range of benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-Party federated learning [chen2006algebraic]

becomes one of the most popular machine learning paradigm thanks to the increasing trends of distributed data collection, storage and processing, as well as its benefits to the privacy-preserved manner in different kinds of applications. In most multi-party machine learning applications, “no raw data sharing” is an important pre-condition, where the model should be trained using all data stored in distributed machines (i.e., parties) without any cross-machine raw data sharing. A wide range of machine learning models and algorithms, including logistic regression 

[pathak2010multiparty], sparse discriminant analysis [bian2017multi, tian2016communication], stochastic gradient-based learners [jayaraman2018distributed, xing2015petuum, ormandi2013gossip], have been re-implemented on distributed computing, encryption, and privacy preserving computation/communication platforms, so as to incorporate the secure computation paradigms [chen2006algebraic].

Backgrounds and Related Work. Existing efforts majorly work on the implementation of efficient federated learning systems. Two parallel computation paradigms—data-centric and model-centric [xing2015petuum, zhou2008large, dean2012large, tsianos2012consensus, tsianos2012consensus, smyth2009asynchronous, ormandi2013gossip]

have been proposed. On each machine, the data centric algorithm first estimates the same set of parameters (of the model) using the local data, then aggregates the estimated parameters via model-averaging for global estimation. The model with aggregated parameters is considered as the trained model based on the overall data (from multiple parties) and before aggregated these parameters can be estimated through parallel computing structure in an easy way. Meanwhile, model-centric algorithms require multiple machines to share the same loss function with “updatable parameters”, and allow each machine to update the parameters in the loss function using the local data so as to minimize the loss. Based on this characteristic, model-centric algorithm commonly updates the parameters sequentially so that the additional time consumption in updating is sometimes a tough nut for specific applications. Even so, compared with the data-centric, the model-centric methods usually can achieve better performances, as it minimizes the risk of the model 

[xing2015petuum, ormandi2013gossip]

. To advance the distributed performance of linear classifiers, Tian et al. 

[tian2016communication] proposed a data-centric sparse linear discriminant analysis algorithm, which leverages the advantage of parallel computing.

In terms of multi-party collaboration, the federated learning algorithms can be categorized into two types: Data separation and View separation. For the data separation, the algorithms are assumed to learn from the distributed datasets, where each dataset consists of a subset of samples of the same types [xing2015petuum, bian2017multi, tian2016communication, jayaraman2018distributed]. For example, hospitals are usually required to collaboratively learn a model to predict patents’ future diseases through classifying their electronic medical records, where all hospitals follows the same scheme to collect patients’ medical record while every hospital can only cover a part of the patients. In this case, federated learning here improves the overall performance of learning through incorporating the private datasets owned by different parties, while ensuring the privacy and security [xing2015petuum, ormandi2013gossip, jayaraman2018distributed]. While the existing data/computation parallelism mechanisms were usually motivated to improve federated learning under the data separation settings, the federated learning systems under the view separation settings are seldom considered.

Our Work. We mainly focus on view separation settings of the federated learning that assumes the data view of the same group of samples are separated by multiple parties who do not trust each other. For example, the healthcare, finance, and insurance records of the same group of healthcare users are usually stored in the data centers of healthcare providers, banks, and insurance companies separately. For the healthcare users, they usually need some recommendations on the healthcare insurance products according to their health and financial status, while healthcare insurance companies need to learn from large-scale healthcare together with personal financial data to build such recommendation models. However, according to the law enforcement about data privacy, it is difficult for these three partities to share their data with each other and learn such a predictive model. In this way, federated learning under view separation models is highly appreciated. In this work, we aim at working on the view separation federated learning algorithms using Gradient Boosting Machines (GBM) as the Classifiers. GBM is studied here as it can deliver decent prediction results and be interpreted by human experts for joint data analytics and cross-institutes data understanding purposes.

Our Contributions. We summarize the contribution of the proposed SecureGBM algorithm in following aspects.

  • Firstly, we study and formulate the federated learning problem under the (semi)-homomorphic encryption settings, while assuming the data owned by two parties are not sharable. More specific, in this paper, we assume each party owns a unique private view to the same group of samples, while the labels of these samples are monopolized by one party. To the best of our knowledge, this is the first study on tree-based Gradient Boosting Machine classifiers, by addressing 1) two-party security constraint, 2) efficient model-centric learning with views separated by two parties but labels “monopolied” by one, and 3) the trade-off between statistical accuracy and the communication cost caused by statistical learning over encrypted communication.

  • Secondly, to achieve the goals, we design the SecureGBM algorithm which re-implements the vanilla gradient-boosting tree based learners using semi-homomorphic encrypted computation operators offered by Microsoft SEAL. More specific, SecureGBM first replaces the addition and multiplication operators used in the gradient-boosting trees with the secured operators based on semi-homomorphic computation, then SecureGBM re-designs a new set of binary comparison operators (i.e., or ) which can not be intercepted by attackers to exactly recover the ground truth through searching with the comparison operators (e.g., binary search).

  • Furthermore, we observe the trade-off between statistical accuracy and communication cost for GBM training. One can use stochastic gradient boosting mechanism to update the training model with mini-batch of data per round, while the communication cost per round can be significantly reduced in a quadratics manner. However, compared to vanilla gradient boosting machines, additional rounds of training procedure might be needed by such stochastic gradient boosting to achieve equivilent performance. In this way, SecureGBM makes trade-off between statistical accuracy and communication complexity using mini-batch sampling strategies, so as to enjoy low communication costs and accelerated training procedure.

  • Finally, we evaluate SecureGBM using a large-scale real-world user profile dataset and several benchmark datasets for classification. The results show that SecureGBM

    can compete with state of the art of Gradient Boosting Machines — LightGBM, XGBoosts and the vanilla re-implementation of LightGBM based on Microsoft SEAL.

The rest of the paper is organized as follows. In Section II, we review the gradient-boosting trees based classifiers and the implementation of LightGBM, then we introduce the problem formulation of our work. In Section III, we propose the framework of SecureGBM and present the details of SecureGBM algorithm. In Section IV, we evaluate the proposed algorithms using the real-world user profile dataset and the benchmark datasets. In addition, we compare SecureGBM with baseline centralized algorithms. In Section V, we introduce the related work and present a discussion. Finally, we conclude the paper in Section VI.

Ii Preliminary Studies and Problem Definitions

In this section, we first present the preliminary studies of the proposed study, then introduce the design goals for the proposed systems as the technical problem definitions.

Ii-a Gradient Boosting and LightGBM

As an ensemble learning technique, the Gradient Boosting classifier trains and combines multiple weak prediction models, such as decision trees, for better generalization performance 

[friedman2001greedy, friedman2002stochastic]. The key idea of gradient boosting is to consider the procedure of boosting as the optimization over certain cost functions [breiman1997arcing]. As the result, the gradient descent directions for the loss function minimization can be transformed into the decision trees that were obtained sequentially to improve the classifier.

Given a training dataset, where each data point , the problem of gradient boosting is to learn a function from all possible hypotheses while minimizing the expectation of loss over the distribution , such that

(1)

where refers to the prediction loss of to the label . More specific, the gradient boosting intends to minimize the loss function and obtain in a gradient descent way, such that

(2)

where refers to the learned model in the iteration, refers to the decision tree learned as the descent direction at the iteration based on the model already obtained and the training dataset, refers to the learning rate of gradient boosting or namely the weight of in the ensemble of learners, the operator refers to the ensemble of and models, and refers to the results of the iteration. More specific, the computation of majorly address for , i.e.,the error between the model that is already estimated and the label that corresponds to in training dataset. Note that in the first iteration, the algorithm starts from which is a vanilla decision tree learned from the dataset. With totally iteration, the algorithm obtains the final model as the .

Recently, gradient boosting classifiers have attracted further attentions from both application and algorithmic perspectives. For example, it has won the KDDCup 2016 [sandulescu2016predicting] and tons of other competition such as Kaggle222https://medium.com/@gautam.karmakar/xgboost-model-to-win-kaggle-e12b35cd1aad. Gradient boosting trees and its variants have been used as a major baselines for a great number of classification/regression tasks with decent results, ranging from genetic data analytic to the click through predictions [nielsen2016tree]. In terms of algorithm implementation, XGBoost [chen2016xgboost] and LightGBM [ke2017lightgbm] have been proposed to further improve the performance of gradient boosting trees, where thw two work followed similar gradient boosting mechanisms for the decision trees training while made significant contributions to scalability and efficiency issues.

Ii-B Homomorphic Encryption Models

To secure the security and privacy during the computation, homomorphic encryption (HE) has been proposed as a set of operations that work on the encrypted data while resulting in the secure ones with encryption. More important the results obtained can be decrypted to match the “true results” of the corresponding operations [gentry2010computing, vaikuntanathan2011computing]. Homomorphic encryption contains multiple types of encryption schemes, such as partially homomorphic encryption (PHE), fully homomorphic encryption (FHE) and pre-fully homomorphic encryption (Pre-FHE), that can perform different classes of computations over encrypted data [armknecht2015guide]. The progress along these lines of research has been well surveyed in [halevi2017homomorphic].

As early as 1978, the tentative idea of building a fully homomorphic encryption scheme was proposed just after the publishing of RSA algorithm [demillo1978foundations]. Thirty years, Gentry emphet al. in 2009 sketched the first fully homomorphic encryption scheme based on the lattice cryptography [gentry2009fully]. One year later, van Dijk et al. presented the second fully homomorphic encryption scheme [van2010fully] based on Gentry’s work, but did not rely on the use of ideal lattices. The second generation of FHE starts from 2011, there were some fundemental techniques developed by Zvika Brakerski et al. [brakerski2014leveled, brakerski2014efficient], where the homomorphic cryptosystems currently used are stemmed. Thanks to these innovations, the second generation of FHE tends to be much more efficient compared with the first generation, and be applied to a lot of applications.

Later, Gentry et al. proposed a new technique for building fully homomorphic encryption schemes, namely GSW, which avoids the use of expensive “relinearization” computation in homomorphic multiplication [gentry2013homomorphic]. Brakerski et al. observed that, for certain types of circuits, the GSW cryptosystem features an even slower growth rate of noise, and hence better efficiency and stronger security [brakerski2014lattice].

As the fully homomorphic encryption is computationally expensive, the most of practical secure systems indeed have been implemented with a partially homomorphic encryption fashion [halevi2017homomorphic], where only parts of computation are encrypted with homomorphic encryption. In this work, we hope to secure the computation and communication of federated learning through partially homomorphic encryption. Our proposed method uses ciphertexts to protects parts of computations and communications in the gradient boosting trees learning.

Ii-C Problems and Overall Design Goals

In this work, we intend to design a novel federated gradient boosting trees classifier that can learned from view separated data in a distributed manner while avoiding the leakage of data privacy and security.

The Federated Learning Problem - Suppose two training datasets and are owned by two parties and respectively, who hope to collaboratively learn one mode but don’t trust each other. The schemes of the two datasets are and , where

  • and refer to the identity sets of samples in the two datasets respectively. When , it indicates that the two datasets share a subset of samples but with different views (i.e., features);

  • and refer to the feature sets of samples in the two datasets respectively. More specific, we denote as the set of features of the sample in and further denote as the feature of the sample in ;

  • refers the label set of the data samples in . More specific, there has and refers to the set of classes. As was mentioned, in the settings discussed in this paper, only one part would monopoly the label information.

With above settings, our federated learning problem consists of two parts as follow.

Training - There needs to propose an dual-party learning algorithm with secure communication and computation schemes that can train tree-based Gradient Boosting Machines based on and with respect to the following restrictions:

  • Training Set Identity Protection: would not be obtained by the party , while the information about would be prohibited from ;

  • Training Set Feature Security: The inference procedure needs to avoid the leakage of and to , and the to the party .

Testing - Given two testing datasets and owned by the two parties and respectively, where . There needs an online inference algorithm, where the party can initialize the inference procedure using the identity of the sample for prediction (i.e., ). The party can obtains the prediction result of the sample (i.e., ) through the secure inference procedure, with respect to the following restrictions:

  • Testing Set Identity Protection: would not be obtained by the party , while the information about would be prohibited from

  • Testing Set Feature Security: The inference procedure needs to avoid the leakage of and to , and to the party .

In our research, we intend to design PHE-based encryption schemes to protect the training and inference procedures (derived from LightGBM [ke2017lightgbm]) and meet above security goals.

Fig. 1: The Training Procedure of SecureGBM

Iii Frameworks and Algorithms Design

In this section, we present the framework design of SecureGBM with key algorithms used.

Iii-a Overall Framework Design

The overall framework of SecureGBM consists of two parts — training and inference, where, given the distributed datasets, the training procedure obtains the distributed parameter models for the tree classifiers of SecureGBM and the inference procedure predicts the labels using the indices of samples.

Iii-A1 Statistically Accelerated Training Procedure

Given the training datasets and distributed in the two parties, as shown in Figure 1, the training procedure learns the ensemble of decision trees for Gradient Boosting Machines with distributed parameters in a secure and statistical efficient way. More specific, the training procedure incorporates an accelerated iterative process with a specific initialization as follow

  • Initialization - The owner of can invoke to initialize the whole training procedure. First of all, SecureGBM performs secure join operation to align the shared samples stored in and through matching and under Partial Homomorphic Encryption (PHE) settings. Later, based on the data in including both features and labels , SecureGBM learns a decision tree as the base model, which only uses features in , for initialization. Please see also in Section III.B.1 for the detailed design and implementation of Secure Join Operation for sample alignment based on PHE.

With the model initialized, SecureGBM takes a statistically accelerated iterative process for GBM training, where each iteration uses mini-batch sampling to reduce the computational/communication costs [friedman2002stochastic]. Specifially, each iteration (e.g., the iteration and ) consists of three steps:

  • Batched Secure Inference - Given the shared samples in , SecureGBM first randomly selects a subset of samples , where and refers to the batch size. With the model already estimated, denoted as , SecureGBM then obtains the “soft prediction” results of all samples in through the secure inference under PHE settings. Such that

    (3)

    where refers to the inference result based on the features from both datasets. Please see also in Section III.A.2 for the PHE-based implementation of the inference procedure.

  • Residual Error Estimation - As was mentioned, both labels and soft prediction results are a

    -dimensional vector, where

    refers to the number of classes. Then, SecureGBM estimates the residual errors of the current model using the cross-entropy as follow

    (4)

    Note that to secure the security of labels, and for are all stored in the owner of .

  • Secure Tree Creation and Ensemble - Given the residual error estimated , SecureGBM boosts the learned model through creating a new decision tree that fits the residual errors using the features of both datasets in an additive manner. SecureGBM then ensembles with the current model and obtains the new model through gradient boosting [friedman2001greedy]. As was mentioned in Eq. (2), a specific learning rate has been given as the weight for model ensembling. Please see also in Section III.B.2 for the detailed design and implementation of Secure Splits Search Operation for decision tree creation based on PHE.

Iii-A2 Inference Procedure via Execution Flow Compilation

Given the model learned after iterations, denoted as , the inference component first complies all these trees into distributed secure execution flow, where the nodes in every decision trees are assigned to the corresponding parties respectively. As shown in Figure 2, all communications, computation and binary comparisons are protected through SEAL-based Homomorphic Encryption schemes. With the secure distributed execution flow, given a index of sample e.g., , SecureGBM runs the inference procedure over the execution flow. Please see also in Section III.B.3 for he design and implementation of PHE-based Binary Comparison Operator. Note that in our research, we assume the party has no way to access the labels of training and testing samples, while securing the monopoly of the label information at the party side. To protect the label information through the inference, the result of PHE-based Binary Comparison Operator (i.e., true or false) has been secured and cannot be deciphered by the party .

Furthermore, for the sample with index that is contained in only, i.e., , SecureGBM would first learn a comprehensive GBM (LightGBM classifier) based on the dataset using the features and labels only. With such model, SecureGBM makes prediction for the samples in all using the features in .

Fig. 2: Execution Flow Compilation for the Inference Procedure

Please note that the overall framework of SecureGBM is derived from the vanilla implementation of LightGBM [ke2017lightgbm], while most of calculation and optimization to gradient boosting trees [friedman2001greedy] has been preserved with coverage of partial homomorphic encryption.

Iii-B Key Algorithms

Here, we present the detailed design of several key algorithms.

Iii-B1 Secure Join for Sample Alignment

To align the samples with identical indices across the two index sets and for training (and and for inference), SecureGBM intends to obtain the intersection between the two index sets in a private and secure manner. Specifically, we adopt the private sets intersection algorithms proposed in [pinkas2014faster, pinkas2018scalable] to achieve the goal. The security and privacy eniforcement of proposed component highly relies on the techniques called Obvious Transfer Extension (OT Extension), which supports fast and private data transfer for small payloads with limited overhead [keller2015actively]. The use of OT Extension can avoid the use of time-consuming error correcting code but instead accelerate the secure data transmission through leveraging a pseudo-random code. We also tried other OT extension based private sets intersection algorithm, such as the one using Bloom Filter [dong2013private]. The speed and scalability is not as good as [pinkas2014faster, pinkas2018scalable].

Iii-B2 Secure Splits Search for Tree Creation

For each round of iteration (e.g., the iteration), SecureGBM needs to create a decision tree with the size (here refers to the number of learned split nodes in the tree) to fit the “residual error” of the model that is already estimated , as as to enable the gradient boosting mechanism. Specifically, we adopt a leaf-wise tree growth mechanism that was derived from LightGBM [ke2017lightgbm] to learn the tree, where SecureGBM vertically grows the tree using totally rounds of computation/communication, and it always picks up the leaf node with maximal “residual error” reduction to grow.

In each round of computation for the decision tree creation, for the party owning features and labels , SecureGBM searches new splits using the raw data. Similar to vanilla LightGBM, SecureGBM selects the “best” split with the maximal residual error reduction for the samples in the mini-batch as the a candidate of the split at the party side. This candidate split would be compared to the “best” splits from the party as the final split selection for this round.

On the other hand, for the party only occupying features , SecureGBM first propose a set of potential splits in (in a random or unsupervised manner), and sends the potential classification results of the samples in using every proposed split to the party . Note that the potential classification results have been formed into multiple sets of samples (or sample index), which have been categorized according to their results. Such sets have been encrypted as private sets to protect the privacy of label information from the party . Certain secure domain isolation has been used to protect the splits [liu2015thwarting].

Then, at the party side, SecureGBM estimates the residual errors of each split proposed by the party using their potential classification results. Specifically, SecureGBM leverage the aforementioned private set intersection algorithm to estimate the overlap between the sample sets categorized using potential classification results and the true labels, in order to obtain the prediction results and estimate the accuracy [pinkas2014faster, pinkas2018scalable]. Finally, SecureGBM selects the split (the best splits from versus ) that can further lower residual error as the split in this round and “adds” the split to the decision tree.

To further secure the privacy of label information, the splits at the party are deployed in an isolated domain, while the party cannot obtain the decision making result of splits. Please refer the section in below for the implementation of the binary comparison.

Iii-B3 Secure Binary Comparison for Decision Making

As was mentioned, SecureGBM operates an isolated domain over the machines at the party , where the computation and comparison criterion for decision making are all stored in the isolated domain, which is trusted by both parties. To further secure the label information and prediction results during inference, SecureGBM uses the public keys generated the party to encrypted the decision making results from , while the public keys keep being updated per inference task.

Iii-C Discussion and Remarks

In this section, we intends to justify the proposed framework and algorithms from costs and learning perspectives.

Communication Costs - In SecureGBM, we replaced the gradient boosting mechanism used by GBM with stochastic gradient boosting [friedman2002stochastic], in order to accelerate the learning procedure through lowering the computational/communication costs per iteration. Let denote as the total number of aligned samples shared by and , while the size of batch for each iteration is defined as .

For each iteration of GBM and SecureGBM, there needs to create a -sized decision tree after rounds of communication between the two parities. For each round of such communication, GBM and SecureGBM need to exchange data with payloads size of and , respectively. In this way, the cost of communication per iteration should be and for GBM and SecureGBM.

Statistical Acceleration - To simplify our analysis on the statistical performance, we make an mild assumption that considers the learning procedure of LightGBM and SecureGBM

 as the gradient descent (GD) and stochastic gradient descent (SGD) based loss minimization over certain convex loss 

[friedman2001greedy, mason2000boosting, friedman2002stochastic]. Under mild convexity and smoothness conditions, GD and SGD would converge to the minimum of the loss functions at the error convergence rate[shapiro1996convergence, shalev2009stochastic, shamir2013stochastic] of and respectively, where denotes the number of iterations. More discussion can be found in [mason2000boosting]. While the costs of each iteration are and respectively, we can roughly conclude a trade-off exists between the statistical performance and communication complexity for SecureGBM Training.

Iv Experiments and Results

In this section, we mainly report the experiments that evaluate SecureGBM, and compares the performance of SecureGBM with other baseline methods including vanilla LightGBM, XGBoost, and other downstream classifiers.

Sparse Adult Phishing
Methods Training Testing Training Testing Training Testing
SecureGBM 93.227 66.220 92.465 90.080 62.855 61.823
Using the Aggregated Datasets from and
LightGBM-(,) 96.102 68.528 92.199 90.145 67.994 63.430
XGBoost-(,) 93.120 67.220 91.830 89.340 67.090 61.990
LIBSVM-Linear-(,) 73.490 64.560 58.641 59.280 50.073 50.980
LIBSVM-RBF-(,) 79.850 63.210 75.549 72.060 52.789 47.479
Using the Single Dataset at
LightGBM- N/A* N/A* 89.849 88.052 64.693 59.743
XGBoost- 65.170 57.370 89.490 87.620 64.070 59.740
LIBSVM-Linear- 52.360 50.675 66.293 34.347 50.007 50.489
LIBSVM-RBF- 56.740 52.380 72.909 55.076 50.248 50.306
Using the Datasets that aggregate Features from and Labels from (Not exist in the real case)
LightGBM-* 96.102 68.528 85.708 84.587 62.396 58.929
XGBoost-* 93.190 67.390 85.700 85.410 61.720 58.420
LIBSVM-Linear-* 67.480 60.990 46.527 46.840 50.627 48.567
LIBSVM-RBF-* 78.230 64.880 56.927 74.987 50.336 50.415
TABLE I: Overall Classification AUC (%) Comparison (N/A: During the experiments, LightGBM reported failure to train the model due as the features of the given datasets are too sparse to learn.)

Iv-a Experimental Setups

In this section, we present the dataasets, baseline algorithms as well as tbe experimental settings of our evaluation study.

Iv-A1 Datasets

In our study, we intend to evaluate SecureGBM using three datasets as follow.

  • Sparse - This is a private dataset consisting of 11,371 users’ de-anonymized financial records data, where each sample with 8,922 extremely sparse features and a binary label. These features are separately owned by two parties — the bank is with 5,000 features and the real estate loaner owns the rest 3,922 features, while the bank owns the label information about bankruptcy. The goal of this dataset is to predict the bankruptcy of a user, incorporating their sparse features distributed in the two parties. As the dataset is quite large, we set the mini-batch size as the 1% of the overall training set. Note that the labels in Sparse are extremely imbalanced, while most samples are negative.

  • Adult - This an open-access dataset consisting of 27,739 web pages’ information, where each web page is with 123 features and 1 binary label (whether the web page contains adult contents). We randomly split the features into two sets, each of which is with 61 and 62 features respectively. As this dataset is quite small, we use the whole dataset for each iteration i.e., of the overall training set.

  • Phishing - This an open-access dataset consisting of 92,651 web pages’ information, where each web page is with 116 features and 1 binary label (whether the web page contains phishing risk). We randomly split the features into two sets, each of which is with 58 features respectively. As this dataset is comparatively large, we use the the mini-batch with of the overall training set.

Iv-A2 Baseline Algorithms with Settings

To understand the advantage of SecureGBM, we compare SecureGBM with the baseline algorithms in below.

  • LightGBM - We consider the vanilla implementation of LightGBM as a key baseline algorithm, where we include two settings LightGBM- and LightGBM-. LightGBM- refers to the LightGBM that is trained using the features and labels in the dataset only, while LightGBM- refers to the vanilla distributed LightGBM that is trained using the both datasets in and , without encryption protection. Finally, we also include a baseline LightGBM- that might not exist in the real-world settings — LightGBM- aggregates the features from the party and label information from as the training dataset. The comparison to LightGBM- and LightGBM- would show the information gain of federated learning beyond the model that was trained using any single party.

  • XGBoost - Following the above settings, we include two settings XGBoost- and XGBoost-. XGBoost- refers to the vanilla XGBoost that was trained using the features and labels in the dataset only, while XGBoost- refers to the vanilla XGBoost trained through aggregating the both datasets from and , in a centralized manner. Similarly, XGBoost- that was trained through aggregating the features from the party and label information from was given a baseline to demostrate the information gain of collaboration.

  • LIBSVM - Following the same settings, we include two settings LIBSVM- and LIBSVM-. LIBSVM- refers to the vanilla LIBSVM that is trained using the features and labels in the dataset only, while LIBSVM- refers to the vanilla XGBoost that is trained using the both datasets in and , in a centralized manner. Similarly, SVM- that was trained through aggregating the features from the party and label information from was given a baseline. More specific, the LIBSVM algorithms with RBF kernel and linear SVM are used here.

Note that in all experiments, 80% samples are used for training and the rest 20% samples are remained for testing. The training and testing sets are randomly selected for 5-folder cross validation. The default learning rate for LightGBM, XGBoost and SecureGBM are all set to 0.1.

(a) on Sparse
(b) on Sparse
(c) on Sparse
(d) on Adult
(e) on Adult
(f) on Adult
(g) on Phishing
(h) on Phishing
(i) on Phishing
Fig. 12: The Comparison of Training AUC and F1-Score per Iteration SecureGBM vs. LightGBM-(,): : Labels in sparse datasets (personal bankruptcy status) are imbalanced with most samples negative; in this case, the learned models are usually imbalanced with very low recall and F1-score.
(a) on Sparse
(b) on Sparse
(c) on Sparse
(d) on Adult
(e) on Adult
(f) on Adult
(g) on Phishing
(h) on Phishing
(i) on Phishing
Fig. 22: The Comparison of Testing AUC and F1-Score per Iteration SecureGBM vs. LightGBM-(,): : Labels in sparse datasets (personal bankruptcy status) are imbalanced with most samples negative; in this case, the learned models are usually imbalanced with very low recall and F1-score.

Iv-B Overall Performance

To evaluate the overall accuracy of SecureGBM, we measured the Area Under Curve (AUC) of the SecureGBM prediction results and compared to the baseline algorithms. Table I presents the overall comparison on AUC achieved by these classifiers under the same settings. SecureGBM, LightGBM, and LIBSVM all have been trained using 200 iterations, where we measure the AUC on training and testing datasets.

The result in Table I shows that, compared to LightGBM-(,), SecureGBM achieved similar training and testing AUC based on the same settings, while significant outperforming LightGBM- that used the single dataset of . Furthermore, under both settings, LightGBM performed better than XGBoost in terms of testing AUC. Similar observations can be obtained through the comparisons between LIBSVM and LightGBM. In short, it is reasonable to conclude that multi-party gradient boosting over two distributed datasets can significantly improve the performance and outperforms the one that uses datasets from the party only.

Furthermore, the comparison to LightGBM-* shows that, except the experiments based on Sparse datasets, SecureGBM significantly outperforms the one that aggregates the features from and labels from . For the Sparse datasets, one can easily observe that LightGBM- failed to train the model, when using the datasets on only, as the features in are too sparse to learn. The comparison between LightGBM-(,) and LightGBM-* further demonstrates that the incorporation of the features in can not improve the performance of LightGBM learning. Due to the same reason, SecureGBM performed slightly worse than LightGBM-* with marginal testing AUC degradation.

We conclude SecureGBM boosts the testing accuracy of learners from the party perspectives, as (1) SecureGBM consistently outperforms LightGBM-, XGBoost- and other learners that uses datasets on only; (2) The algorithms that aggregates datasets from the both sides, such as LightGBM-(, ) or LightGBM-*, only perform marginally better than SecureGBM, while these algorithms scarifying the data privacy of the two parties.

Iv-C Case Studies

To further understand the performance of SecureGBM, we traced back the the models obtained after each iteration and analysis their accuracy from both accuracy and efficiency perspectives.

Iv-C1 Trends of Accuracy Improved per Iteration

Figure 12 presents the the comparison of training AUC and F1-score per iteration, between SecureGBM versus vanilla LightGBM. More specific, we evaluated the performance when and (as the LightGBM and SecureGBM use leaf-wise growth strategy, is equivalent to the depth of each decision tree learned), where we clearly observed the error convergence of the model.

It has been observed that, in most cases, the training F1-score could be gradually improved with the increasing number of iterations. For sparse and Adult datasets, the overall trends of AUC and F1-score for LightGBM and SecureGBM were almost same under all settings — even though, for Sparse dataset, SecureGBM only used 1% training data as the mini-batch for the model update per iteration (while LightGBM used the whole). Furthermore, even though SecureGBM did not perform as good as LightGBM for the Phishing dataset when and , it still achieved decent performance like LightGBM under the appropriately setting . Such observations are quite encouraging that the use of mini-batch seems to not hurt the learning progress of SecureGBM for these datasets, with appropriate settings. The lines of SecureGBM are with more jitters, due to the use of stochastic approximation for statistical acceleration. Similar observations have been obtained in the comparison of testing AUC and F1-score per iteration which has been shown in Figure 22.

# Samples 1,000 2,000 4,000 8,000 16,000
SecureGBM 10.20 10.50 11.4 11.75 14.30
LightGBM 0.16 0.32 0.70 1.41 2.45
XGBoost 0.51 0.73 1.09 2.20 4.30
TABLE II: Time Consumption per Iteration (seconds) on a Synthesized Dataset over Varying Number of Training Samples

Iv-C2 Time Consumption over Scale of Problem

To test the time consumption of SecureGBM over varying scale of the problem, we synthesize a dataset based on Sparse with increasing number of samples. The experiment results showed that the time consumption per iteration for the training procedure of SecureGBM is significantly longer than LightGBM and XGBoost.

We estimate the slowdown ratio of SecureGBM as the ration between the time consumption per iteration for SecureGBM versus vanilla LightGBM. The range of slowdown ratio is around 3x 64x in this experiment. Furthermore, with the number of samples increases, the slowdown ratio of SecureGBM would decrease significantly. For example the ratio is around 63.75x when comparing SecureGBM to LightGBM with 1,000 training samples, while it is only 5.8 when compared to LightGBM with 16,000 training samples. It is not difficult to conclude that SecureGBM is quite time efficient due to its statistical acceleration strategies used, SecureGBM would become more and more efficient when the scale of training set increases. The experiments are carried out using two workstations based on 8-core Xeon CPUs and 16GB memory. The two machines are interconnected with a 100 MBit cable with 1.55ms latency.

V Discussion and Conclusion

In this work, we present SecureGBM a secure multi-party (re-)design of LightGBM [ke2017lightgbm], where we assume the view (i.e., set of features) of the same group of samples has been split into two parts and owned by two parties separately. To collaboratively train a model while preserving the privacy of the two parties, a group of partial homomorphic encryption (PHE) computation models and multi-party computation protocols have been used to cover the key operators of distributed LightGBM learning and inference over two parties. As the use of PHE and multi-party computation models cause hudge computational and communication overhead, certain statistical acceleration strategies have been proposed to lower the cost of communication while securing the statistical accuracy of learned model through stochastic approximation. With such statistical acceleration strategies, SecureGBM would become more and more efficient, with decreasing slowdown ratio, when the scale of training datasets increases.

The experiments based on several large real-world datasets show that SecureGBM can achieve decent testing accuracy (i.e., AUC and F1-score) as good as vanilla LightGBM (based on the aggregated datasets from the two parties), using a time consumption tolerable training procedure (5x 64x slowdown), without compromising the data privacy. Furthermore, the ablation study that compares SecureGBM to the learners, which uses the single dataset from one party, showed that such collaboration between two parties can improve the accuracy.

References