Federated Forest

05/24/2019 ∙ by Yang Liu, et al. ∙ Microsoft 0

Most real-world data are scattered across different companies or government organizations, and cannot be easily integrated under data privacy and related regulations such as the European Union's General Data Protection Regulation (GDPR) and China' Cyber Security Law. Such data islands situation and data privacy & security are two major challenges for applications of artificial intelligence. In this paper, we tackle these challenges and propose a privacy-preserving machine learning model, called Federated Forest, which is a lossless learning model of the traditional random forest method, i.e., achieving the same level of accuracy as the non-privacy-preserving approach. Based on it, we developed a secure cross-regional machine learning system that allows a learning process to be jointly trained over different regions' clients with the same user samples but different attribute sets, processing the data stored in each of them without exchanging their raw data. A novel prediction algorithm was also proposed which could largely reduce the communication overhead. Experiments on both real-world and UCI data sets demonstrate the performance of the Federated Forest is as accurate as the non-federated version. The efficiency and robustness of our proposed system had been verified. Overall, our model is practical, scalable and extensible for real-life tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial intelligence has made great progress in recent years thanks to the large amount of data collected in different domains. Unfortunately, the data has also arisen to be the largest bottleneck for the implementation of AI methods. In real-world applications, the big data are scattered across different companies or government organizations and stored in the form of data islands, in other words, data across different domains cannot be shared with each other. For companies the data is among one of the most important assets of companies which cannot be easily shared. Governments’ data are highly secured and mostly not utilized. Besides, people now are highly sensitive about data privacy. Data breaches happen occasionally and most countries now either have data privacy-related legislation enacted or being drafted. In 2018, the European Union enacted the General Data Protection Regulation (GDPR) (Regulation, 2016). The GDPR provides individuals with more control over their personal data and states strict principles and absolute transparencies on how businesses should handle these data. Any type of tracking or record of personal data must be authorized by the customer before collection and business must clearly state their intentions and plans for the data. Faced with the difficulties and restrictions, the question becomes if it is worth investing effort to make use of the scattered data.

The answer is yes. Academia, companies and governments could all benefit from resolving the data islands situation. The joint-models are able to improve many current services and products, and support more potential applications, including but not limited to medical study, targeted marketing, urban anomalies detection and risk management, as shown in Figure

1. For example, banks could train joint-models with e-commerce companies to achieve a precised customer profiling and improve their marketing strategies. Government organizations could work with ride-hailing companies to have a better understanding of city’s daily traffic flow and adjust the road planing based on it.

Figure 1: New Era of Machine Learning

Consequently, the question becomes how can we train the joint-models. Faced with the challenges of data islands and data privacy & security, the current available methods cannot completely solve the problems. Because of this, developing new methods to bridge the gap between real-world applications and data islands becomes an urgent problem. In 2016, a new approach named federated learning (McMahan et al., 2016; Konečnỳ et al., 2016; Konecnỳ et al., 2016) was proposed, which mainly focuses on building privacy-preserved machine learning models when data are distributed in different places. Federated learning has provided a new approach to look at the current problems, and shwon the possibility of real-life applications.

Inspired by their work, we proposed a novel privacy-preserving tree-based machine learning model, named Federated Forest (FF). Based on it, we developed a secure cross-regional machine learning system, which is capable of conquering the challenges described above. Our contributions are four-folds:

  • [leftmargin=*]

  • Secured privacy. Data privacy is fully protected by redesigning the tree building algorithms, applying encryption methods and establishing a third-party trusty server. The contents and amount of information exchange are limited to a minimum, and each participant is blind to others.

  • Lossless (accurate). Our model is based on the methodologies of CART (Breiman et al., 1984) and bagging (Breiman, 1996), and fits the vertical federated setting. We experimentally proved that our model can achieve the same level of accuracy as the non-federated approach that brings the data into one place.

  • Efficiency. An efficient communication mechanism was implemented with methods of MPI (Barney, 2019) for sharing of the intermediate values. A fast prediction algorithm was designed and it’s weakly correlated (scale-free) to the number of domains and trees, maximum tree depth and sample size.

  • Practicability and scalability. Our model supports both classification and regression tasks, and is strongly practical, extensible and scalable for real-life applications. The experiments on real-world data sets had proved our model’s accuracy, efficiency and robustness.

2 Related Work

2.1 Federated Learning

Federated learning (McMahan et al., 2016; Konečnỳ et al., 2016; Konecnỳ et al., 2016)

was first proposed to solve the problems that rich data are generated from user devices, but due to regulations it’s difficult to build models from the data. The solution is to keep the data on user devices and train a shared model by aggregating locally calculated intermediate results in neural networks. In

(Chen et al., 2018) they proposed a new recommender system which applies federated learning to meta-learning. Federated learning has also been applied to solve multi-task problems in (Smith et al., 2017) and a loss-based AdaBoost method was developed in (Huang et al., 2018). (Hardy et al., 2017a)

introduced a vertically-aggregated federated learning method. In their work, each data provider possessed unique features, and sample IDs are aligned between them. They jointly learned a logistic regression model to secure the data privacy and keep modeling accurate. In addition, a modular benchmarking framework for federated settings was presented in the work of

(Caldas et al., 2018). Although many research products have been coming out, the definition of federated learning was still blurry until the work of (Yang et al., 2019)

. They categorized current federated learning methods into three types, horizontal federated learning, vertical federated learning and federated transfer learning. Following this survey, the same team introduced a new framework known as secure federated transfer learning

(Liu et al., 2018) to build models for target-domain party by leveraging rich labels from source-domain party, as the data sets of the two parties are different in both sample space and feature space. In (Cheng et al., 2019) they reviewed the tree-boosting method and applied it to the vertical federated setting. A lossless framework was proposed and it was able to keep information of each private data provider from being revealed. In (Zhuo et al., 2019)

they presented a novel reinforcement learning approach that considers the privacy requirement and builds Q-network for each agent with the help of other agents. To make the federated machine learning more practical, they are pushing to build a

Federated AI Ecosystem such that the partners can fully exploit their data’s value and promote vertical applications. An IEEE standard Guide For Architectural Framework And Application Of Federated Machine Learning (Group, 2019) was also initialized and is being drafted.

2.2 Data Privacy Protection

In federated learning, there are two major encryption methods applied for protecting data privacy and security, which are differential privacy (Dwork, 2006) and homomorphic encryption (Gentry, 2009). The idea of differential privacy is to add properly calibrated noise to the algorithm or the data, with examples including (Geyer et al., 2017; McMahan et al., 2017). This approach will not affect computational efficiency too much but may weaken model performance. Homomorphic encryption is a method that supports secure multiplication and addition on encrypted data, and once the result is decrypted, it should match the output of operations on the corresponding raw data. The work of (Hardy et al., 2017b; Le et al., 2018; Kim et al., 2018a) all used this approach. There are two major drawbacks of homomorphic encryption. First, the complexity of the algorithm is high and it will be intensely time consuming for frequent use. Second, it does not support operations of non-linear functions, such as Sigmoid and Logarithmic function, and approximations are necessary. In the work of (Hardy et al., 2017a)

they used the Taylor expansion to approximate the Sigmoid function and

(Kim et al., 2018b) used least squares method. In theory these approaches could work but in our practice the results were not ideal.

3 Problem Formulation

3.1 Data Distribution

In our work, we focus on the vertical federated learning problems, in which all participants have the same sample space but different feature space, as shown in Figure 2. Consider each company or government organization as a regional data domain, denoted as , then the overall data domain is , where . is the number of regional domains. We denote the feature space of as , then the entire feature space is . During the modeling process, all features’ true names were encoded to protect privacy. For any and , if and , then . In our work, all domains have the same number of samples and the sample IDs were aligned across domains. One master machine was deployed as the parameter server and multiple client machines were used, where each contains one regional data domain. The labels were provided by one of the clients, which we assume to be client . Then the labels were copied to the master and clients in encrypted forms. Two things to notice here: 1) In reality, is usually small and even means there are five different organizations modeling together, which could be rare. The model design can be totally different for large . 2) We are not going to talk about the methods of ID alignment since it is another research topic, discussed in work such as (Nock et al., 2018). The notations appeared in this paper are also shown in Table 3.

3.2 Problem Statement

The formal statement of the problem is given as below:

Given: Regional domain and encrypted label on each client , .

Learn: A Federated Forest, such that for each tree in the forest: 1) a complete tree model is held on master; 2) a partial tree model is stored on each client , .

Constraint: The performance (accuracy, f1-score, MSE, e.t.c.) of the Federated Forest must be comparable to the non-federated random forest.

4 Methodology

Figure 2: Federated Forest

Here we present the framework of Federated Forest, which is based on the CART tree (Breiman et al., 1984) and bagging (Breiman, 1996), and is able to deal with both classification and regression problems. The framework is shown as in Figure 2 and details of the algorithms are given in the following subsections.

4.1 Model Building

Algorithm.

In our work, each tree is built by all parties working together and the tree structure is stored on the master node and every client. However, each tree only stores the split information with respect to their own features. We first present the client-side Federated Forest algorithm in Algorithm 1, and in Algorithm 2 we described how the master coordinates the modeling process.

Input : Data set on client ;
Local features or ;
Encrypted label ;
Output : Partial Federated Forest Model on Client
while tree_build is True do
        Receive and for current tree building;
        Function TreeBuild (, , )
               Create empty tree node;
               if the pre-pruning condition is satisfied then
                      Mark current node as leaf node;
                      Assign leaf label by voting;
                      return leaf node;
              ;
               if  then
                      Compute impurity improvement for any and find local maximum ;
                      Record local best split feature and split threshold;
              Send encrypted to master;
               if receive the split message from master then
                      /* Global best split feature is from itself */
                      is_selected True;
                      Split samples and send sample indices of left and right subtrees to master;
              else
                     Receive sample indices of left and right subtrees;
              left_subtree TreeBuild (, , );
               right_subtree TreeBuild (, , );
               if is_selected is True then
                      Save and split threshold to tree node;
              Save subtrees to tree node;
               return tree node;
       Append current tree to forest;
return Partial Federated Forest Model on Client ;
ALGORITHM 1 Federated Forest – Client

Following the bagging paradigm, the master node first randomly selects a subset of features and samples from the entire data. Then the master will notify each client the selected features and sample IDs privately. For the selected features, master will notify each client privately. For example, if ten features are chosen by the master and client 1 only possesses three of them, then client 1 will only know these three features were selected. It will never know how many features were chosen globally, not to mention what the features were. During the tree construction, the pre-pruning conditions are frequently checked. If the conditions are satisfied, the clients and master will create leaf nodes accordingly.

If the termination condition is not triggered, all clients enter the splitting state, and the best split feature of the current tree node will be selected by comparing the impurity improvements. First, each client finds the local optimal split feature . Then the master collects all local optimal features and corresponding impurity improvements, allowing the global best feature to be found. Second, the master notifies the client who provided the global best feature. The corresponding client will split the samples and send the data partition results (sample IDs that fall into left and right subtrees) to the master for distribution. For the current tree node, only the client that provides the best split feature will save the details of this split. The other clients are only aware that the selected feature is not contributed by themselves. The split information such as threshold and split feature are also unknown to them. Last, the subtrees are recursively created and the current tree node is returned. In modeling, if the child trees nodes are created successfully, the parent node doesn’t need to save the sample IDs for the subtrees. Otherwise, if the connection is down, the modeling can be easily recovered from the break point.

Input : Indices of ;
Encoded features ;
Encrypted label ;
Output : Complete Federated Forest Model
/*Build trees for forest recurrently*/
while tree_build is True do
        Broadcast randomly selected samples ;
        Randomly select features from and send to client ;
        Function  TreeBuild (, , )
               Create empty tree node;
               if the pre-pruning condition is satisfied then
                      Mark current node as leaf node;
                      Assign leaf label by voting;
                      return leaf node;
              Receive encrypted and related information from all clients;
               Take and notify client ;
               Receive split indices from client and broadcast;
               left_subtree TreeBuild ();
               right_subtree TreeBuild ();
               Save subtrees and split info to tree node;
               return tree node;
       Append current tree to forest;
       
return Complete Federated Forest Model;
ALGORITHM 2 Federated Forest – Master

Model Storage.

A tree predictive model is composed of two parts, tree structure and split information such as feature and threshold used for each split. Since the forest is built with all clients working together, the structure of each tree on every client is the same. However, for a given tree node, the client may or may not store the detailed information. Only the master server is able to store the complete model. For each tree node, the client will store the corresponding split threshold only if it provided the split feature. If not, the client will store nothing at the current node but only keep the node structure. We denoted the complete tree nodes as , the one saved on master, and denoted the tree nodes without full details stored by th client as . Since the tree structure is consistent, we consider , and , where is the leaf node sets. The complete tree is the union of all partial trees, that .

4.2 Model Prediction

Input : Partial federated forest model saved on th client;
Encoded features on th client;
Test set on th client;
Output : Samples IDs of leaf on ,
while TreePrediction is True do
        Function TreePredict (, , )
               if is_leaf is True then
                     Return sample IDs and leaf label;
              else
                      if  keeps the split info of current node then
                            Split samples into subtrees;
                             left_subtree TreePredict (, , );
                             right_subtree TreePredict (, , );
                            
                     else
                             left_subtree TreePredict (, , );
                             right_subtree TreePredict (, , );
                            
                     Return left and right subtrees;
              Send to master;
              
       
ALGORITHM 3 Federated Forest Prediction – Client

Under the vertical federated setting (Yang et al., 2019), the classical approach of prediction involves multiple rounds of communication between the master and clients, even for only one sample. When the number of trees, maximum tree depth and sample size are large, the communication requirements for predicting will become a serious burden. To address this problem, we designed a novel prediction method which takes the advantage of our distributed model storage strategy. Our method only needs one round of collective communication for each tree and even for the overall forest. We first present the prediction algorithm of the client side in Algorithm 3, and in Algorithm 4, we described how the master server coordinates each client to achieve the final predictions.

First, each client uses the locally stored model to predict samples. For the tree on th client, each sample enters from the root node, and finally falls into one or several leaf nodes through the binary tree. When the sample travels through each node, if the model stores the split information at this node, then this sample is determined to enter the left or right subtree by checking the split threshold. If the model does not have split information at this node, the sample simultaneously enters both left and right subtrees.

Secondly, the path determination of the tree node is performed recursively until each sample falls into one or several leaf nodes. When this process is finished, each leaf node of tree on client will keep a batch of samples. We use to represent the samples that fall into the leaf node of the tree model , where . is the set of leaf nodes of the tree .

Input : Sample IDs of test set
Output :  Prediction of Federated Forest
while TreePrediction is True do
        Gather ;
        Obtain , where ;
        Return label of leaf for samples in , ;
Calculate forest predictions by voting on the results of trees;
return Final Predictions;
ALGORITHM 4 Federated Forest Prediction – Master

Thirdly, for each leaf , the master will take the intersection on , and the result will be . Then the sample sets owned by each leaf node on complete tree are already associated with final predictions. Here we gave a formal proposition on our new prediction method so it can be mathematically defined:

proposition 1.

For samples fall into one or multiple leaves on tree , then for any leaf of the complete tree , the sample IDs in leaf can be obtained by taking intersection of , that .

The proof is provided in Appendix Proof of the Proposition 1. After obtaining the label values for each sample on all trees, we can easily achieve final predictions. In this approach, we only need one round of communication for each tree, or even only one round for the entire forest.

4.3 Privacy Protection

Here we have categorized our efforts on the privacy protection into five parts:

Identities. In real world tasks, we often face situations where IDs of samples are tied to persons’ real identities. Because of this, we have to encrypt the identities before the ID alignment. An example approach could be like following: First all clients use an agreed hash method to transform the sample IDs and generate new hashed IDs. Then Message-Digest Algorithm 5 (MD5) can be applied on the hashed IDs and generate irreversibly encrypted IDs.

Labels. For classification problems, even labels are encoded, we could still guess the true values, especially for binary classification. For regression problems, even though labels can be encrypted with homomorphic encryption, it will be extremely time consuming for modeling. In practical tasks, there will be a trade-off between the security protection and the computational efficiency.

Features. On each client, local features were encoded before given to the master for global feature sampling. So the master will not know the real meaning of features.

Communication. Encryption methods such as RSA and AES can be applied to secure everything (model intermediate values, sample indices, e.t.c.) communicated during the training and prediction.

Model Storage. The entire model was distributed across all clients. For each node, the client would store the corresponding split information only if the split feature is on local machine. If not, it only stored the structure of the current node. Clients knew nothing about each other including whose features were selected and at which tree nodes. Master can optionally keep a copy of the entire model.

5 Experimental Studies

5.1 Experimental Setup

In this section, we used 9 benchmark data sets, including one real-world data set target marketing and 8 public data sets from UCI (Dua and Graff, 2017; Sakar et al., 2019; Fernandes et al., 2015; Hamidieh, 2018), as shown in Table 1. Different sample sizes and feature spaces were considered, and the accuracy, efficiency and robustness of our proposed framework were tested for both classification and regression problems. In our experiments we did not pursue absolute accuracy and instead tested whether the performance of our methods is at the same level as the non-federated approach, i.e., lossless. The target marketing data set was collected from two totally different domains. One of them was from an e-commerce company and contains 84 features, and the other one was from a bank which provided 11 features. Before modeling all the sensitive information was protected. Three main series of experiments were conducted in this section, including experiments with two data providers, experiments with multiple data providers, and analysis of prediction efficiency. The details of each test are given in the following subsections.

5.2 Experiments with Two-Party Scenarios

In this part, exposed UCI data sets were vertically and randomly separated by feature dimension and placed on two different client servers (), each containing half of the feature space from original data. For target marketing, it was also placed on two different client servers, of which each contained several business domains. The experiments in this section are summarized as following:

  • [leftmargin=*]

  • Federated Logistic/Linear Regression

    (F-LR): We jointly trained logistic/linear regression models, where data is kept locally and the model is partly stored in each client.

  • Non-Federated Forest (NonFF): All data were integrated together for Random Forest modeling.

  • Random Forest 1 (RF1): Partial data from the client was used to build a random forest model.

  • Random Forest 2 (RF2): Partial data from the client was used to build a random forest model.

  • Federated Forest (FF): This is our proposed model, which two parties jointly learn a random forest. Data were kept locally and model was partly stored in each client.

We conducted the experiments on both classification and regression problems, and present the results of accuracy and RMSE in Table 1. We found that the performance of RF1 and RF2 were obviously worse than the NonFF and FF. Both RF1 and RF2 can be considered as modeling with data from one business domain, and the insufficient feature space resulted in imperfect study of the global knowledge. We also found in most tests that the regression models didn’t perform very well. For the test on target marketing, since direct aggregation of data between two institutions was not allowed, we only ran tests for RF1, RF2, F-LR and FF. The results show that FF performs as expected and a better accuracy is achieved by building models on different domains.

Classification RF1 RF2 F-LR NonFF FF p-value
target marketing 0.870 0.848 0.862 - -
ionosphere 0.864 0.828 0.873 0.211
spambase 0.844 0.831 0.873 0.065
parkinson (Sakar et al., 2019) 0.849 0.849 0.829 0.744
kdd cup 99 0.974 0.965 -
waveform 0.745 0.743 - 0.029
gene 0.975 0.975 - 0.229
Regression RF1 RF2 F-LR NonFF FF p-value
year prediction 10.47 10.72 9.56 0.058
Superconduct Hamidieh (2018) 19.74 17.49 17.52 3 0.186
Table 1: Classification and regression experiments

For most of the data sets, NonFF and FF outperformed the other methods. In our method, we were building each tree by processing globally on every regional domain, which was same to the tree built by aggregating raw data together. Z-Test was applied to verify the lossless of our method compared with NonFF, of which the null hypothesis is that the means from two populations are equal at a given level of significance. For each data set, 40 rounds of tests on the NonFF and FF were performed and the

p-value of each Z-Test is given in Table 1. If the , the null hypothesis cannot be rejected at the 0.05 level and there is no significant difference between the outputs of NonFF and FF. If , the null hypothesis cannot be rejected at the 0.01 level. And statistically, we consider there exists a slight but acceptable difference for this range of p-value. The null hypothesis should be rejected if with a significant difference between the means. By examining the p-value of each data set, we can find that there are six of them proved to have no significant difference between the results of NonFF and FF, and for the rest data sets the differences are slight. No null hypotheses were rejected.

Overall, we can safely confirm that the Federated Forest is a lossless solution for both classification and regression problems, which achieves the same performance as the non-federated random forest.

5.3 Experiments with Multi-Party Scenario

In this part, we ran tests on the parkinson data set to verify whether the Federated Forest is capable of conjoining more than two domains effectively and if a reasonable improvement on accuracy could be achieved. We chose parkinson to run the test since it already contains eight clearly categorized sub-domains. As for tests of training and prediction efficiency, we duplicated data for ten times. In the tests, each time we added one domain into the federated model, and we recorded the accuracy, training and prediction time. As shown in Figure 3, the accuracy of Federated Forest improved consistently. The training execution time was almost linearly with respect to to the number of domains, which is to be expected because all features are be examined in tree building. For the prediction time, though more domains and features were added, the difference in execution time was negligible. The results demonstrate that our new prediction algorithm is very effective when handling multiple regional domains.

Figure 3: Accy. & Exec. Time vs. # of Domains

5.4 Prediction Efficiency

In this part, we compared the efficiency of our new prediction method with the classical prediction approach. We used target marketing, spambase and waveform data sets as the examples. We ran all the tests for 20 times and report the average results, as shown in Figures 4, 5 and 6. The solid lines with dot marker represent the results of classical prediction method, and the dash lines with x marker represent our proposed prediction method.

Figure 4:

Prediction Time vs. Number of Estimators

Firstly, we set the maximum tree depth to 4 and changed the number of estimators from 8 to 32, and the results were shown in Figure 4. It can be seen that our method produced a strong improvement on the prediction efficiency. Though the execution time of both methods increased linearly respect to the number of estimators, the slope varied dramatically between our method and the classical prediction method. For the classical method, there are multiple rounds of communication in each node during prediction. But in our method, there is only one round of communication for each tree.

Figure 5: Prediction Time vs. Max Depth

Secondly, we set the number of estimators to 8, and adjusted the maximum tree depth from 4 to 16. As shown in Figure 5, our method outperformed the classical prediction method again. By increasing the maximum tree depth, the growth rate of prediction time for both methods gradually slowed down and stabilized. This is because by setting the maximum depth to a large number, the tree building may early stop due to pre-pruning and the actual tree depth will be smaller. In our method, no matter how deep the tree is or how many leaf nodes are created, communication was only executed once for each tree.

Figure 6: Prediction Time vs. Test Sample Size

Finally, we fixed the number of estimators and maximum tree depth, and changed the test sample rate from 0.1 to 0.4, as shown in Figure 6. Because the classical approach has a strong linear correlation with the sample size, we found that its results presented a linear growth trend. Meanwhile the execution time of our method changed very slowly, which shows our method is robust to prediction sample size.

Overall, our new prediction method had been proved to be highly efficient.

6 Conclusions

In this paper, we proposed a novel tree-based machine learning model, called Federated Forest, which is lossless with respect to the model accuracy and protects data privacy. A secure cross-regional machine learning system was developed based on it, which allows a learning model to be jointly trained across different clients with the same user samples but different attribute sets. The raw data on each client are not exposed and exchanged to other clients during the modeling. A novel prediction algorithm was proposed which could largely reduce the communication overhead and improve the prediction efficiency. Data privacy was secured by redesigning the tree algorithms, deploying encryption methods and establishing a third-party trusted server. Raw data will never be directly exchanged, only limited amount of intermediate values between each party. We performed experiments on both real-world and UCI data sets, showing the superior performance in classification and regression tasks, and the proposed Federated Forest was proven to be as accurate as the non-federated random forest that requires gathering the data into one place. The efficiency and robustness of our proposed system have also been verified. Overall, the Federated Forest overcomes the challenges of the data islands problem and privacy protection in a brand new approach, and it can be deployed for real-world applications.

Acknowledgement

Special thanks to Chentian Jin for valuable discussions and feedback.

References

Appendix

Reproducibility

Our model is implemented with Python 3.6, Scikit-learn 0.20, Numpy 1.15.4, python-paillier 1.4.1 and mpi4py 3.0.0. We train/evaluate our model on servers each with 4 CPU cores and Centos 7.0. The information of all used data sets are given in Table 2.

Classification Size Features Classes
target marketing 156198 95(11/84) 2
ionosphere 351 34 2
spambase 4601 57 2
parkinson (Sakar et al., 2019) 756 754 2
kddcup99 4M 42 23
waveform 5000 21 3
gene 801 20531 5
Regression Size Features Range
year prediction 515345 90 1922-2011
Superconduct Hamidieh (2018) 21263 81 0.0002-185
Table 2: Data sets

Pseudo-code for FF-Regressor

The main difference between regression and classification problem lies in the generation of leaf node result and the final predictions. The following is the pseudo-code of regression problem, where the difference from the classification problem is in the line 7 of Algorithm 5, line 9 of Algorithm 6 and line 5 of Algorithm 8.

Input : Data set on client ;
Local features or ;
Homomorphic encrypted label ;
Output : Partial Federated Forest Model on Client
1 while tree_build is True do
2       Receive and for current tree building;
3       Function TreeBuild (, , )
4             Create empty tree node;
5             if the pre-pruning condition is satisfied then
6                   Mark current node as leaf node;
7                   Assign leaf label by averaging;
8                   return leaf node;
9             end if
10            ;
11             if  then
12                   Compute impurity improvement for any and find local maximum ;
13                   Record local best split feature and split threshold;
14             end if
15            Send encrypted to master;
16             if receive the split message from master then
17                   /* Global best split feature is from itself */
18                   is_selected True;
19                   Split samples and send sample indices of left and right subtrees to master;
20            else
21                   Receive sample indices of left and right subtrees;
22             end if
23            left_subtree TreeBuild (, , );
24             right_subtree TreeBuild (, , );
25             if is_selected is True then
26                   Save and split threshold to tree node;
27             end if
28            Save subtrees to tree node;
29             return tree node;
30       end
31      Append current tree to forest;
32       return Partial Federated Forest Model on Client ;
33      
34 end while
ALGORITHM 5 Federated Forest – Client
Input : Indices of ;
Encoded features ;
Encrypted label ;
Output : Complete Federated Forest Model
1 /*Build trees for forest recurrently*/
2 while tree_build is True do
3       Broadcast randomly selected samples ;
4       Randomly select features from and send to client ;
5       Function  TreeBuild (, , )
6             Create empty tree node;
7             if the pre-pruning condition is satisfied then
8                  Mark current node as leaf node;
9                   Assign leaf label by averaging;
10                   return leaf node;
11             end if
12            Receive encrypted and related information from all clients;
13             Take and notify client ;
14             Receive split indices from client and broadcast;
15             left_subtree TreeBuild ();
16             right_subtree TreeBuild ();
17             Save subtrees and split info to tree node;
18             return tree node;
19       end
20      Append current tree to forest;
21       return Complete Federated Forest Model;
22 end while
ALGORITHM 6 Federated Forest – Master
Input : Partial federated forest model saved on th client;
Encoded features on th client;
Test set on th client;
Output : Samples IDs of leaf on ,
1 while TreePrediction is True do
2       Function TreePredict (, , )
3             if is_leaf is True then
4                  Return sample IDs and leaf label;
5            else
6                   if  keeps the split info of current node then
7                         Split samples into subtrees based on threshold;
8                         left_subtree TreePredict (, , );
9                         right_subtree TreePredict (, , );
10                        
11                  else
12                         left_subtree TreePredict (, , );
13                         right_subtree TreePredict (, , );
14                        
15                   end if
16                  Return left and right subtrees;
17                  
18             end if
19            Send to master;
20       end
21      return ;
22 end while
ALGORITHM 7 Federated Forest Prediction – Client
Input : Sample IDs of test set ;
Output :  Prediction of Random Forest
1 while TreePrediction is True do
2       Gather ;
3       Obtain , where ;
4       Return label of leaf for samples in , ;
5 end while
6Calculate forest predictions by averaging the results of trees;
return Final Predictions;
ALGORITHM 8 Federated Forest Prediction – Master

Notations In Proof

  • [leftmargin=*]

  • Sample IDs are denoted as , and contains the sample IDs which fall into leaf of tree . denotes the sample set of leaf node in the complete binary tree model .

  • The test sample set is , and the single sample is .

  • is the set of decision making paths of sample that goes through the binary tree to fall into the leaf node of . For the tree , it is possible that falls into more than one leaf, due to our model storage strategy.

  • is the decision making path of the sample that goes through the complete binary tree to fall into the leaf node in . For the complete tree , if sample fall into one leaf, then it cannot fall into another leaf. It means that any leaf and in , .

  • The complete tree on master is defined as .

  • Detailed descriptions of notations are shown in Table 3.

Notation Description
number of regional domains
data set held by client
total number of samples in training
entire data set
feature space of
entire feature space of ,
labels
partial decision/regression tree stored on th client
complete tree
leaf nodes set of the entire tree
leaf node of the current tree,
lowest common ancestor of in
the sample IDs of entire data set
the sample IDs which fall into leaf of tree
the sample IDs which fall into leaf of complete tree
single test sample
entire test sample set
the set of decision making paths of sample on
decision making path of sample on
maxmium tree depth

Table 3: Notations

Proof of the Proposition 1

For the prediction process, samples will go through the client tree and fall into one or multiple leaves. For any leaf of the complete tree , the sample IDs in leaf can be obtained by taking intersection of , that .

Proof.

In order to prove , we will prove:

Proof of :

For any sample in the leaf of the complete tree , . denotes its decision making path from root to leaf node. For model on each client , if the model stores split information at the current node, it is determined according to the threshold whether this sample enters the left or right subtree. If the current model does not store split information at this node, the sample enters left and right subtrees simultaneously. Therefore for sample , its decision making path on the complete tree must be subset of its decision making path on any client . Then we have , which is equivalent to . Because of this we can safely say that for any in . Then we can prove that .

Proof of :

Assume that sample doesn’t belong to leaf node but belongs to in complete model , which is and . Besides, we assume .

, obtained by the above proof.

That is to say, sample will fall into the leaf node and at the same time in every model stored on client.

In the same binary tree structure, the path from a child node to the root node is fixed and unique.

Under the complete tree structure, the path set of the leaf node and up to the root node is . And the lowest common ancestor node exists and is uniquely set to .

So

So no platform stores the information of the node .

This contradicts to .

Therefor the hypothesis doesn’t hold.

In summary, we can prove .

Communication Complexity Analysis

Here we give a brief analysis on communication complexity. There are mainly three types of communication during the training, where is the number of regional domains:

  • [leftmargin=*]

  • Send and receive. Master sends randomly selected features to each client in every turn for tree building and the client who saves the global optimal feature sends the sample split indices of this feature to master when building the node. The communication complexity is .

  • Broadcast. Master broadcasts sample indices for each tree node construction. The communication complexity is .

  • Gather. Master gathers and compares the impurity improvement of features at every turn for node building. It also gathers sample sets of all leaves on each tree stored by clients in the prediction process. The communication complexity is .

Since the maximum depth is , in a tree, there are at most intermediate nodes and leaf nodes. Take the process of building a tree for example, the communication complexity of the whole system in training phase is . For the prediction phase, if not optimized, the communication complexity is , otherwise, the optimized communication complexity is .