1 Introduction
Artificial intelligence has made great progress in recent years thanks to the large amount of data collected in different domains. Unfortunately, the data has also arisen to be the largest bottleneck for the implementation of AI methods. In realworld applications, the big data are scattered across different companies or government organizations and stored in the form of data islands, in other words, data across different domains cannot be shared with each other. For companies the data is among one of the most important assets of companies which cannot be easily shared. Governments’ data are highly secured and mostly not utilized. Besides, people now are highly sensitive about data privacy. Data breaches happen occasionally and most countries now either have data privacyrelated legislation enacted or being drafted. In 2018, the European Union enacted the General Data Protection Regulation (GDPR) (Regulation, 2016). The GDPR provides individuals with more control over their personal data and states strict principles and absolute transparencies on how businesses should handle these data. Any type of tracking or record of personal data must be authorized by the customer before collection and business must clearly state their intentions and plans for the data. Faced with the difficulties and restrictions, the question becomes if it is worth investing effort to make use of the scattered data.
The answer is yes. Academia, companies and governments could all benefit from resolving the data islands situation. The jointmodels are able to improve many current services and products, and support more potential applications, including but not limited to medical study, targeted marketing, urban anomalies detection and risk management, as shown in Figure
1. For example, banks could train jointmodels with ecommerce companies to achieve a precised customer profiling and improve their marketing strategies. Government organizations could work with ridehailing companies to have a better understanding of city’s daily traffic flow and adjust the road planing based on it.Consequently, the question becomes how can we train the jointmodels. Faced with the challenges of data islands and data privacy & security, the current available methods cannot completely solve the problems. Because of this, developing new methods to bridge the gap between realworld applications and data islands becomes an urgent problem. In 2016, a new approach named federated learning (McMahan et al., 2016; Konečnỳ et al., 2016; Konecnỳ et al., 2016) was proposed, which mainly focuses on building privacypreserved machine learning models when data are distributed in different places. Federated learning has provided a new approach to look at the current problems, and shwon the possibility of reallife applications.
Inspired by their work, we proposed a novel privacypreserving treebased machine learning model, named Federated Forest (FF). Based on it, we developed a secure crossregional machine learning system, which is capable of conquering the challenges described above. Our contributions are fourfolds:

[leftmargin=*]

Secured privacy. Data privacy is fully protected by redesigning the tree building algorithms, applying encryption methods and establishing a thirdparty trusty server. The contents and amount of information exchange are limited to a minimum, and each participant is blind to others.

Lossless (accurate). Our model is based on the methodologies of CART (Breiman et al., 1984) and bagging (Breiman, 1996), and fits the vertical federated setting. We experimentally proved that our model can achieve the same level of accuracy as the nonfederated approach that brings the data into one place.

Efficiency. An efficient communication mechanism was implemented with methods of MPI (Barney, 2019) for sharing of the intermediate values. A fast prediction algorithm was designed and it’s weakly correlated (scalefree) to the number of domains and trees, maximum tree depth and sample size.

Practicability and scalability. Our model supports both classification and regression tasks, and is strongly practical, extensible and scalable for reallife applications. The experiments on realworld data sets had proved our model’s accuracy, efficiency and robustness.
2 Related Work
2.1 Federated Learning
Federated learning (McMahan et al., 2016; Konečnỳ et al., 2016; Konecnỳ et al., 2016)
was first proposed to solve the problems that rich data are generated from user devices, but due to regulations it’s difficult to build models from the data. The solution is to keep the data on user devices and train a shared model by aggregating locally calculated intermediate results in neural networks. In
(Chen et al., 2018) they proposed a new recommender system which applies federated learning to metalearning. Federated learning has also been applied to solve multitask problems in (Smith et al., 2017) and a lossbased AdaBoost method was developed in (Huang et al., 2018). (Hardy et al., 2017a)introduced a verticallyaggregated federated learning method. In their work, each data provider possessed unique features, and sample IDs are aligned between them. They jointly learned a logistic regression model to secure the data privacy and keep modeling accurate. In addition, a modular benchmarking framework for federated settings was presented in the work of
(Caldas et al., 2018). Although many research products have been coming out, the definition of federated learning was still blurry until the work of (Yang et al., 2019). They categorized current federated learning methods into three types, horizontal federated learning, vertical federated learning and federated transfer learning. Following this survey, the same team introduced a new framework known as secure federated transfer learning
(Liu et al., 2018) to build models for targetdomain party by leveraging rich labels from sourcedomain party, as the data sets of the two parties are different in both sample space and feature space. In (Cheng et al., 2019) they reviewed the treeboosting method and applied it to the vertical federated setting. A lossless framework was proposed and it was able to keep information of each private data provider from being revealed. In (Zhuo et al., 2019)they presented a novel reinforcement learning approach that considers the privacy requirement and builds Qnetwork for each agent with the help of other agents. To make the federated machine learning more practical, they are pushing to build a
Federated AI Ecosystem such that the partners can fully exploit their data’s value and promote vertical applications. An IEEE standard Guide For Architectural Framework And Application Of Federated Machine Learning (Group, 2019) was also initialized and is being drafted.2.2 Data Privacy Protection
In federated learning, there are two major encryption methods applied for protecting data privacy and security, which are differential privacy (Dwork, 2006) and homomorphic encryption (Gentry, 2009). The idea of differential privacy is to add properly calibrated noise to the algorithm or the data, with examples including (Geyer et al., 2017; McMahan et al., 2017). This approach will not affect computational efficiency too much but may weaken model performance. Homomorphic encryption is a method that supports secure multiplication and addition on encrypted data, and once the result is decrypted, it should match the output of operations on the corresponding raw data. The work of (Hardy et al., 2017b; Le et al., 2018; Kim et al., 2018a) all used this approach. There are two major drawbacks of homomorphic encryption. First, the complexity of the algorithm is high and it will be intensely time consuming for frequent use. Second, it does not support operations of nonlinear functions, such as Sigmoid and Logarithmic function, and approximations are necessary. In the work of (Hardy et al., 2017a)
they used the Taylor expansion to approximate the Sigmoid function and
(Kim et al., 2018b) used least squares method. In theory these approaches could work but in our practice the results were not ideal.3 Problem Formulation
3.1 Data Distribution
In our work, we focus on the vertical federated learning problems, in which all participants have the same sample space but different feature space, as shown in Figure 2. Consider each company or government organization as a regional data domain, denoted as , then the overall data domain is , where . is the number of regional domains. We denote the feature space of as , then the entire feature space is . During the modeling process, all features’ true names were encoded to protect privacy. For any and , if and , then . In our work, all domains have the same number of samples and the sample IDs were aligned across domains. One master machine was deployed as the parameter server and multiple client machines were used, where each contains one regional data domain. The labels were provided by one of the clients, which we assume to be client . Then the labels were copied to the master and clients in encrypted forms. Two things to notice here: 1) In reality, is usually small and even means there are five different organizations modeling together, which could be rare. The model design can be totally different for large . 2) We are not going to talk about the methods of ID alignment since it is another research topic, discussed in work such as (Nock et al., 2018). The notations appeared in this paper are also shown in Table 3.
3.2 Problem Statement
The formal statement of the problem is given as below:
Given: Regional domain and encrypted label on each client , .
Learn: A Federated Forest, such that for each tree in the forest: 1) a complete tree model is held on master; 2) a partial tree model is stored on each client , .
Constraint: The performance (accuracy, f1score, MSE, e.t.c.) of the Federated Forest must be comparable to the nonfederated random forest.
4 Methodology
Here we present the framework of Federated Forest, which is based on the CART tree (Breiman et al., 1984) and bagging (Breiman, 1996), and is able to deal with both classification and regression problems. The framework is shown as in Figure 2 and details of the algorithms are given in the following subsections.
4.1 Model Building
Algorithm.
In our work, each tree is built by all parties working together and the tree structure is stored on the master node and every client. However, each tree only stores the split information with respect to their own features. We first present the clientside Federated Forest algorithm in Algorithm 1, and in Algorithm 2 we described how the master coordinates the modeling process.
Following the bagging paradigm, the master node first randomly selects a subset of features and samples from the entire data. Then the master will notify each client the selected features and sample IDs privately. For the selected features, master will notify each client privately. For example, if ten features are chosen by the master and client 1 only possesses three of them, then client 1 will only know these three features were selected. It will never know how many features were chosen globally, not to mention what the features were. During the tree construction, the prepruning conditions are frequently checked. If the conditions are satisfied, the clients and master will create leaf nodes accordingly.
If the termination condition is not triggered, all clients enter the splitting state, and the best split feature of the current tree node will be selected by comparing the impurity improvements. First, each client finds the local optimal split feature . Then the master collects all local optimal features and corresponding impurity improvements, allowing the global best feature to be found. Second, the master notifies the client who provided the global best feature. The corresponding client will split the samples and send the data partition results (sample IDs that fall into left and right subtrees) to the master for distribution. For the current tree node, only the client that provides the best split feature will save the details of this split. The other clients are only aware that the selected feature is not contributed by themselves. The split information such as threshold and split feature are also unknown to them. Last, the subtrees are recursively created and the current tree node is returned. In modeling, if the child trees nodes are created successfully, the parent node doesn’t need to save the sample IDs for the subtrees. Otherwise, if the connection is down, the modeling can be easily recovered from the break point.
Model Storage.
A tree predictive model is composed of two parts, tree structure and split information such as feature and threshold used for each split. Since the forest is built with all clients working together, the structure of each tree on every client is the same. However, for a given tree node, the client may or may not store the detailed information. Only the master server is able to store the complete model. For each tree node, the client will store the corresponding split threshold only if it provided the split feature. If not, the client will store nothing at the current node but only keep the node structure. We denoted the complete tree nodes as , the one saved on master, and denoted the tree nodes without full details stored by th client as . Since the tree structure is consistent, we consider , and , where is the leaf node sets. The complete tree is the union of all partial trees, that .
4.2 Model Prediction
Under the vertical federated setting (Yang et al., 2019), the classical approach of prediction involves multiple rounds of communication between the master and clients, even for only one sample. When the number of trees, maximum tree depth and sample size are large, the communication requirements for predicting will become a serious burden. To address this problem, we designed a novel prediction method which takes the advantage of our distributed model storage strategy. Our method only needs one round of collective communication for each tree and even for the overall forest. We first present the prediction algorithm of the client side in Algorithm 3, and in Algorithm 4, we described how the master server coordinates each client to achieve the final predictions.
First, each client uses the locally stored model to predict samples. For the tree on th client, each sample enters from the root node, and finally falls into one or several leaf nodes through the binary tree. When the sample travels through each node, if the model stores the split information at this node, then this sample is determined to enter the left or right subtree by checking the split threshold. If the model does not have split information at this node, the sample simultaneously enters both left and right subtrees.
Secondly, the path determination of the tree node is performed recursively until each sample falls into one or several leaf nodes. When this process is finished, each leaf node of tree on client will keep a batch of samples. We use to represent the samples that fall into the leaf node of the tree model , where . is the set of leaf nodes of the tree .
Thirdly, for each leaf , the master will take the intersection on , and the result will be . Then the sample sets owned by each leaf node on complete tree are already associated with final predictions. Here we gave a formal proposition on our new prediction method so it can be mathematically defined:
proposition 1.
For samples fall into one or multiple leaves on tree , then for any leaf of the complete tree , the sample IDs in leaf can be obtained by taking intersection of , that .
The proof is provided in Appendix Proof of the Proposition 1. After obtaining the label values for each sample on all trees, we can easily achieve final predictions. In this approach, we only need one round of communication for each tree, or even only one round for the entire forest.
4.3 Privacy Protection
Here we have categorized our efforts on the privacy protection into five parts:
Identities. In real world tasks, we often face situations where IDs of samples are tied to persons’ real identities. Because of this, we have to encrypt the identities before the ID alignment. An example approach could be like following: First all clients use an agreed hash method to transform the sample IDs and generate new hashed IDs. Then MessageDigest Algorithm 5 (MD5) can be applied on the hashed IDs and generate irreversibly encrypted IDs.
Labels. For classification problems, even labels are encoded, we could still guess the true values, especially for binary classification. For regression problems, even though labels can be encrypted with homomorphic encryption, it will be extremely time consuming for modeling. In practical tasks, there will be a tradeoff between the security protection and the computational efficiency.
Features. On each client, local features were encoded before given to the master for global feature sampling. So the master will not know the real meaning of features.
Communication. Encryption methods such as RSA and AES can be applied to secure everything (model intermediate values, sample indices, e.t.c.) communicated during the training and prediction.
Model Storage. The entire model was distributed across all clients. For each node, the client would store the corresponding split information only if the split feature is on local machine. If not, it only stored the structure of the current node. Clients knew nothing about each other including whose features were selected and at which tree nodes. Master can optionally keep a copy of the entire model.
5 Experimental Studies
5.1 Experimental Setup
In this section, we used 9 benchmark data sets, including one realworld data set target marketing and 8 public data sets from UCI (Dua and Graff, 2017; Sakar et al., 2019; Fernandes et al., 2015; Hamidieh, 2018), as shown in Table 1. Different sample sizes and feature spaces were considered, and the accuracy, efficiency and robustness of our proposed framework were tested for both classification and regression problems. In our experiments we did not pursue absolute accuracy and instead tested whether the performance of our methods is at the same level as the nonfederated approach, i.e., lossless. The target marketing data set was collected from two totally different domains. One of them was from an ecommerce company and contains 84 features, and the other one was from a bank which provided 11 features. Before modeling all the sensitive information was protected. Three main series of experiments were conducted in this section, including experiments with two data providers, experiments with multiple data providers, and analysis of prediction efficiency. The details of each test are given in the following subsections.
5.2 Experiments with TwoParty Scenarios
In this part, exposed UCI data sets were vertically and randomly separated by feature dimension and placed on two different client servers (), each containing half of the feature space from original data. For target marketing, it was also placed on two different client servers, of which each contained several business domains. The experiments in this section are summarized as following:

[leftmargin=*]

Federated Logistic/Linear Regression
(FLR): We jointly trained logistic/linear regression models, where data is kept locally and the model is partly stored in each client.

NonFederated Forest (NonFF): All data were integrated together for Random Forest modeling.

Random Forest 1 (RF1): Partial data from the client was used to build a random forest model.

Random Forest 2 (RF2): Partial data from the client was used to build a random forest model.

Federated Forest (FF): This is our proposed model, which two parties jointly learn a random forest. Data were kept locally and model was partly stored in each client.
We conducted the experiments on both classification and regression problems, and present the results of accuracy and RMSE in Table 1. We found that the performance of RF1 and RF2 were obviously worse than the NonFF and FF. Both RF1 and RF2 can be considered as modeling with data from one business domain, and the insufficient feature space resulted in imperfect study of the global knowledge. We also found in most tests that the regression models didn’t perform very well. For the test on target marketing, since direct aggregation of data between two institutions was not allowed, we only ran tests for RF1, RF2, FLR and FF. The results show that FF performs as expected and a better accuracy is achieved by building models on different domains.
Classification  RF1  RF2  FLR  NonFF  FF  pvalue 
target marketing  0.870  0.848  0.862      
ionosphere  0.864  0.828  0.873  0.211  
spambase  0.844  0.831  0.873  0.065  
parkinson (Sakar et al., 2019)  0.849  0.849  0.829  0.744  
kdd cup 99  0.974  0.965    
waveform  0.745  0.743    0.029  
gene  0.975  0.975    0.229  
Regression  RF1  RF2  FLR  NonFF  FF  pvalue 
year prediction  10.47  10.72  9.56  0.058  
Superconduct Hamidieh (2018)  19.74  17.49  17.52  3  0.186 
For most of the data sets, NonFF and FF outperformed the other methods. In our method, we were building each tree by processing globally on every regional domain, which was same to the tree built by aggregating raw data together. ZTest was applied to verify the lossless of our method compared with NonFF, of which the null hypothesis is that the means from two populations are equal at a given level of significance. For each data set, 40 rounds of tests on the NonFF and FF were performed and the
pvalue of each ZTest is given in Table 1. If the , the null hypothesis cannot be rejected at the 0.05 level and there is no significant difference between the outputs of NonFF and FF. If , the null hypothesis cannot be rejected at the 0.01 level. And statistically, we consider there exists a slight but acceptable difference for this range of pvalue. The null hypothesis should be rejected if with a significant difference between the means. By examining the pvalue of each data set, we can find that there are six of them proved to have no significant difference between the results of NonFF and FF, and for the rest data sets the differences are slight. No null hypotheses were rejected.Overall, we can safely confirm that the Federated Forest is a lossless solution for both classification and regression problems, which achieves the same performance as the nonfederated random forest.
5.3 Experiments with MultiParty Scenario
In this part, we ran tests on the parkinson data set to verify whether the Federated Forest is capable of conjoining more than two domains effectively and if a reasonable improvement on accuracy could be achieved. We chose parkinson to run the test since it already contains eight clearly categorized subdomains. As for tests of training and prediction efficiency, we duplicated data for ten times. In the tests, each time we added one domain into the federated model, and we recorded the accuracy, training and prediction time. As shown in Figure 3, the accuracy of Federated Forest improved consistently. The training execution time was almost linearly with respect to to the number of domains, which is to be expected because all features are be examined in tree building. For the prediction time, though more domains and features were added, the difference in execution time was negligible. The results demonstrate that our new prediction algorithm is very effective when handling multiple regional domains.
5.4 Prediction Efficiency
In this part, we compared the efficiency of our new prediction method with the classical prediction approach. We used target marketing, spambase and waveform data sets as the examples. We ran all the tests for 20 times and report the average results, as shown in Figures 4, 5 and 6. The solid lines with dot marker represent the results of classical prediction method, and the dash lines with x marker represent our proposed prediction method.
Prediction Time vs. Number of Estimators
Firstly, we set the maximum tree depth to 4 and changed the number of estimators from 8 to 32, and the results were shown in Figure 4. It can be seen that our method produced a strong improvement on the prediction efficiency. Though the execution time of both methods increased linearly respect to the number of estimators, the slope varied dramatically between our method and the classical prediction method. For the classical method, there are multiple rounds of communication in each node during prediction. But in our method, there is only one round of communication for each tree.
Secondly, we set the number of estimators to 8, and adjusted the maximum tree depth from 4 to 16. As shown in Figure 5, our method outperformed the classical prediction method again. By increasing the maximum tree depth, the growth rate of prediction time for both methods gradually slowed down and stabilized. This is because by setting the maximum depth to a large number, the tree building may early stop due to prepruning and the actual tree depth will be smaller. In our method, no matter how deep the tree is or how many leaf nodes are created, communication was only executed once for each tree.
Finally, we fixed the number of estimators and maximum tree depth, and changed the test sample rate from 0.1 to 0.4, as shown in Figure 6. Because the classical approach has a strong linear correlation with the sample size, we found that its results presented a linear growth trend. Meanwhile the execution time of our method changed very slowly, which shows our method is robust to prediction sample size.
Overall, our new prediction method had been proved to be highly efficient.
6 Conclusions
In this paper, we proposed a novel treebased machine learning model, called Federated Forest, which is lossless with respect to the model accuracy and protects data privacy. A secure crossregional machine learning system was developed based on it, which allows a learning model to be jointly trained across different clients with the same user samples but different attribute sets. The raw data on each client are not exposed and exchanged to other clients during the modeling. A novel prediction algorithm was proposed which could largely reduce the communication overhead and improve the prediction efficiency. Data privacy was secured by redesigning the tree algorithms, deploying encryption methods and establishing a thirdparty trusted server. Raw data will never be directly exchanged, only limited amount of intermediate values between each party. We performed experiments on both realworld and UCI data sets, showing the superior performance in classification and regression tasks, and the proposed Federated Forest was proven to be as accurate as the nonfederated random forest that requires gathering the data into one place. The efficiency and robustness of our proposed system have also been verified. Overall, the Federated Forest overcomes the challenges of the data islands problem and privacy protection in a brand new approach, and it can be deployed for realworld applications.
Acknowledgement
Special thanks to Chentian Jin for valuable discussions and feedback.
References
 (1)
 Barney (2019) Blaise Barney. 2019. Message Passing Interface (MPI). Lawrence Livermore National Laboratory. Available at https://computing.llnl.gov/tutorials/mpi.
 Breiman (1996) Leo Breiman. 1996. Bagging Predictors. Machine Learning 24, 2 (01 Aug 1996), 123–140. https://doi.org/10.1023/A:1018054314350
 Breiman et al. (1984) L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. 1984. Classification and Regression Trees. Taylor & Francis.
 Caldas et al. (2018) Sebastian Caldas, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. 2018. LEAF: A Benchmark for Federated Settings. arXiv:cs.LG/1812.01097
 Chen et al. (2018) Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. 2018. Federated MetaLearning for Recommendation. arXiv:cs.LG/1802.07876
 Cheng et al. (2019) Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, and Qiang Yang. 2019. SecureBoost: A Lossless Federated Learning Framework. arXiv:cs.LG/1901.08755
 Dua and Graff (2017) Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Dwork (2006) Cynthia Dwork. 2006. Differential Privacy. In Automata, Languages and Programming, Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–12.
 Fernandes et al. (2015) Kelwin Fernandes, Pedro Vinagre, and Paulo Cortez. 2015. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. In Progress in Artificial Intelligence, Francisco Pereira, Penousal Machado, Ernesto Costa, and Amílcar Cardoso (Eds.). Springer International Publishing, Cham, 535–546.
 Gentry (2009) Craig Gentry. 2009. A fully homomorphic encryption scheme. Ph.D. Dissertation. Stanford University. crypto.stanford.edu/craig.
 Geyer et al. (2017) Robin C. Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially Private Federated Learning: A Client Level Perspective. arXiv:cs.CR/1712.07557
 Group (2019) Federated Machine Learning Working Group. 2019. P3652.1  Guide for Architectural Framework and Application of Federated Machine Learning. Available at https://standards.ieee.org/project/3652_1.html.
 Hamidieh (2018) Kam Hamidieh. 2018. A datadriven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science 154 (2018), 346 – 354. https://doi.org/10.1016/j.commatsci.2018.07.052
 Hardy et al. (2017a) Stephen Hardy, Wilko Henecka, Hamish IveyLaw, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017a. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv:cs.LG/1711.10677
 Hardy et al. (2017b) Stephen Hardy, Wilko Henecka, Hamish IveyLaw, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017b. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv:cs.LG/1711.10677
 Huang et al. (2018) Li Huang, Yifeng Yin, Zeng Fu, Shifa Zhang, Hao Deng, and Dianbo Liu. 2018. LoAdaBoost:LossBased AdaBoost Federated Machine Learning on medical Data. arXiv:cs.LG/1811.12629
 Kim et al. (2018b) Miran Kim, Yongsoo Song, Shuang Wang, Yuhou Xia, and Xiaoqian Jiang. 2018b. Secure logistic regression based on homomorphic encryption: Design and evaluation. JMIR medical informatics 6, 2 (2018), e19.

Kim et al. (2018a)
Sangwook Kim, Masahiro
Omori, Takuya Hayashi, Toshiaki Omori,
Lihua Wang, and Seiichi Ozawa.
2018a.
PrivacyPreserving Naive Bayes Classification Using Fully Homomorphic Encryption. In
Neural Information Processing, Long Cheng, Andrew Chi Sing Leung, and Seiichi Ozawa (Eds.). Springer International Publishing, Cham, 349–358.  Konecnỳ et al. (2016) Jakub Konecnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016. Federated Optimization: Distributed Machine Learning for OnDevice Intelligence. arXiv:cs.LG/1610.02527
 Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated Learning: Strategies for Improving Communication Efficiency. arXiv:cs.LG/1610.05492

Le
et al. (2018)
Trieu Phong Le, Yoshinori
Aono, Takuya Hayashi, Lihua Wang, and
Shiho Moriai. 2018.
PrivacyPreserving Deep Learning via Additively Homomorphic Encryption.
IEEE Transactions on Information Forensics & Security PP, 99 (2018), 1–1.  Liu et al. (2018) Yang Liu, Tianjian Chen, and Qiang Yang. 2018. Secure Federated Transfer Learning. arXiv:cs.LG/1812.03337
 McMahan et al. (2016) H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2016. CommunicationEfficient Learning of Deep Networks from Decentralized Data. arXiv:cs.LG/1602.05629
 McMahan et al. (2017) H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2017. Learning Differentially Private Recurrent Language Models. arXiv:cs.LG/1710.06963
 Nock et al. (2018) Richard Nock, Stephen Hardy, Wilko Henecka, Hamish IveyLaw, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2018. Entity Resolution and Federated Learning get a Federated Resolution. arXiv:cs.DB/1803.04035
 Regulation (2016) General Data Protection Regulation. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46. Official Journal of the European Union (OJ) 59, 188 (2016), 294.
 Sakar et al. (2019) C Okan Sakar, Gorkem Serbes, Aysegul Gunduz, Hunkar C Tunc, Hatice Nizam, Betul Erdogdu Sakar, Melih Tutuncu, Tarkan Aydin, M Erdem Isenkul, and Hulya Apaydin. 2019. A comparative analysis of speech signal processing algorithms for Parkinsonś disease classification and the use of the tunable Qfactor wavelet transform. Applied Soft Computing 74 (2019), 255–263.
 Smith et al. (2017) Virginia Smith, ChaoKai Chiang, Maziar Sanjabi, and Ameet Talwalkar. 2017. Federated MultiTask Learning. arXiv:cs.LG/1705.10467
 Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2 (2019), 12.
 Zhuo et al. (2019) Hankz Hankui Zhuo, Wenfeng Feng, Qian Xu, Qiang Yang, and Yufeng Lin. 2019. Federated Reinforcement Learning. arXiv:cs.LG/1901.08277
Appendix
Reproducibility
Our model is implemented with Python 3.6, Scikitlearn 0.20, Numpy 1.15.4, pythonpaillier 1.4.1 and mpi4py 3.0.0. We train/evaluate our model on servers each with 4 CPU cores and Centos 7.0. The information of all used data sets are given in Table 2.
Classification  Size  Features  Classes 
target marketing  156198  95(11/84)  2 
ionosphere  351  34  2 
spambase  4601  57  2 
parkinson (Sakar et al., 2019)  756  754  2 
kddcup99  4M  42  23 
waveform  5000  21  3 
gene  801  20531  5 
Regression  Size  Features  Range 
year prediction  515345  90  19222011 
Superconduct Hamidieh (2018)  21263  81  0.0002185 
Pseudocode for FFRegressor
The main difference between regression and classification problem lies in the generation of leaf node result and the final predictions. The following is the pseudocode of regression problem, where the difference from the classification problem is in the line 7 of Algorithm 5, line 9 of Algorithm 6 and line 5 of Algorithm 8.
Notations In Proof

[leftmargin=*]

Sample IDs are denoted as , and contains the sample IDs which fall into leaf of tree . denotes the sample set of leaf node in the complete binary tree model .

The test sample set is , and the single sample is .

is the set of decision making paths of sample that goes through the binary tree to fall into the leaf node of . For the tree , it is possible that falls into more than one leaf, due to our model storage strategy.

is the decision making path of the sample that goes through the complete binary tree to fall into the leaf node in . For the complete tree , if sample fall into one leaf, then it cannot fall into another leaf. It means that any leaf and in , .

The complete tree on master is defined as .

Detailed descriptions of notations are shown in Table 3.
Notation  Description 

number of regional domains  
data set held by client  
total number of samples in training  
entire data set  
feature space of  
entire feature space of ,  
labels  
partial decision/regression tree stored on th client  
complete tree  
leaf nodes set of the entire tree  
leaf node of the current tree,  
lowest common ancestor of in  
the sample IDs of entire data set  
the sample IDs which fall into leaf of tree  
the sample IDs which fall into leaf of complete tree  
single test sample  
entire test sample set  
the set of decision making paths of sample on  
decision making path of sample on  
maxmium tree depth  

Proof of the Proposition 1
For the prediction process, samples will go through the client tree and fall into one or multiple leaves. For any leaf of the complete tree , the sample IDs in leaf can be obtained by taking intersection of , that .
Proof.
In order to prove , we will prove:
Proof of :
For any sample in the leaf of the complete tree , . denotes its decision making path from root to leaf node. For model on each client , if the model stores split information at the current node, it is determined according to the threshold whether this sample enters the left or right subtree. If the current model does not store split information at this node, the sample enters left and right subtrees simultaneously. Therefore for sample , its decision making path on the complete tree must be subset of its decision making path on any client . Then we have , which is equivalent to . Because of this we can safely say that for any in . Then we can prove that .
Proof of :
Assume that sample doesn’t belong to leaf node but belongs to in complete model , which is and . Besides, we assume .
, obtained by the above proof.
That is to say, sample will fall into the leaf node and at the same time in every model stored on client.
In the same binary tree structure, the path from a child node to the root node is fixed and unique.
Under the complete tree structure, the path set of the leaf node and up to the root node is . And the lowest common ancestor node exists and is uniquely set to .
So
So no platform stores the information of the node .
This contradicts to .
Therefor the hypothesis doesn’t hold.
In summary, we can prove .
∎
Communication Complexity Analysis
Here we give a brief analysis on communication complexity. There are mainly three types of communication during the training, where is the number of regional domains:

[leftmargin=*]

Send and receive. Master sends randomly selected features to each client in every turn for tree building and the client who saves the global optimal feature sends the sample split indices of this feature to master when building the node. The communication complexity is .

Broadcast. Master broadcasts sample indices for each tree node construction. The communication complexity is .

Gather. Master gathers and compares the impurity improvement of features at every turn for node building. It also gathers sample sets of all leaves on each tree stored by clients in the prediction process. The communication complexity is .
Since the maximum depth is , in a tree, there are at most intermediate nodes and leaf nodes. Take the process of building a tree for example, the communication complexity of the whole system in training phase is . For the prediction phase, if not optimized, the communication complexity is , otherwise, the optimized communication complexity is .
Comments
There are no comments yet.