I Introduction
As an emerging machine learning paradigm, federated learning (FL) enables data owners to collaboratively train models by sharing gradients instead of raw data. The core idea of federated learning is to let each client perform calculations locally on its data to obtain certain intermediate results (e.g., gradients) and then exchange the results with other clients in a secure manner. Existing FL work mainly focuses on the horizontal setting, including designing different model aggregation algorithms [DBLP:journals/isci/ChenRT18, DBLP:conf/aistats/McMahanMRHA17, DBLP:conf/nips/DinhTN20] or solving data Non-IID issues [DBLP:conf/aaai/HuangCZWLPZ21, DBLP:conf/iclr/LiJZKD21].
There are a few studies on vertical federated learning (VFL). In VFL, the participants share the same example ID space but are different in feature space. Existing VFL research mainly implements different machine learning algorithms, such as decision trees
[DBLP:journals/pvldb/WuCXCO20, DBLP:journals/corr/abs-1901-08755]and deep learning
[DBLP:journals/corr/abs-2008-10838, DBLP:conf/ijcai/ZhangWWXP18], in the context of data privacy-preserving. Nonetheless, the existing VFL algorithm has the following problems. First, some existing methods (e.g.,[DBLP:conf/ijcai/ZhangWWXP18]) involve a large amount of data interaction between the active party and the passive party and use homomorphic encryption to encrypt data to ensure data security, which requires a lot of computing resources. Second, the existing VFL only considers static scenarios, that is, the participants in the federated learning have all the data from the beginning, and it does not change. However, in real life, data usually grows dynamically, making the overlapping samples between participants in the VFL continue to increase. Intuitively, machine learning methods designed for static scenarios can be updated by fine-tuning, but fine-tuning can only work under the assumption that the distribution of new and old data is similar. This assumption is not always true in real life. In many scenarios, the distribution of new data is different from that of the original data. In these cases, using fine-tuning to update the model will encounter the problem of catastrophic forgetting. Specifically, when the data distribution of the new data is different from the old data, it means that the model needs to acquire knowledge from the non-stationary data distribution, and the new knowledge will interfere with the old knowledge. Then, using fine-tuning to update the model will cause the model to overwrite or forget the knowledge learned from the old data.To alleviate the problems mentioned above, we propose a novel VFL method for dynamic data named Dynamic Vertical Federated Learning (DVFL for short). Compared to previous methods, the contributions of our work are:
-
DVFL is suitable for dynamic scenarios of vertical federated learning, that is, participants do not acquire all the data at the beginning, the data increases dynamically, and the data distribution of the new data is not necessarily the same as that of the old data.
-
In DVFL, model training is performed locally as much as possible, which can reduce the interaction between parties, thereby improving data security and model efficiency.
-
DVFL does not require participants to share their original data or data encoded by a single neural network.
To evaluate the performance of DVFL in different scenarios, we conducted a lot of experiments on benchmark data sets. The results show that the performance of DVFL in static scenarios is comparable to that of the baseline methods, and it has high efficiency and effectiveness in dynamic scenarios.
Ii Related Work
Vertical Federated Learning (VFL) refers to the technology of federated learning under the setting of different feature spaces for all parties. Different from horizontal federated learning that each client can calculate the loss independently, VFL requires multiple parties to complete the calculation and optimization of the loss function under the framework of security and confidentiality. The existing VFL methods can be divided into linear-based methods, tree-based methods, kernel-based methods, and neural network-based methods. Linear model-based VFL methods include
[DBLP:journals/iacr/GasconSB0DZE16, DBLP:journals/corr/abs-1711-10677, DBLP:conf/sp/MohasselZ17]. They use hybrid MPC (secure multi-party computing) protocol [DBLP:conf/focs/Yao82b] or additive homomorphic encryption [DBLP:conf/eurocrypt/BrickellY87] for secure linear model training. Tree-based VFL models include [DBLP:journals/pvldb/WuCXCO20, DBLP:journals/corr/abs-1901-08755]. They enable participating parties to collaboratively build a tree or an forest without information leakage by designing sepecial protocols. Kernel-based VFL methods include [DBLP:series/lncs/DangGH20, DBLP:conf/kdd/GuDLH20], they approximate the kernel function and federatedly updated the prediction function by the designed gradient. Neural network-based methods include [DBLP:journals/corr/abs-2008-10838, DBLP:conf/ijcai/ZhangWWXP18]. These methods use the active and passive parties to calculate the loss to optimize parameters. Homomorphic encryption is often used to ensure information security.Iii Problem Statement
We consider the problem of dynamic vertical federated learning. Let be the dataset distributed on different parties and the examples are aligned by using encrypted entity alignment techniques [DBLP:journals/corr/abs-1803-04035]. The active party A holds a dataset and the label , where , is the number of classes. The passive party B holds a dataset whose size increases over time. At timestamp , party B holds dataset , where and . The increased data of party B from to is . Our goal is to design an algorithm that satisfies the following restrictions.
-
and cannot be exposed to each other.
-
uses the data of under the privacy protection setting to help improve the performance of the classification model.
-
The proposed algorithm should be able to adapt to the dynamic changes of the passive dataset. At each timestamp , even if the data distribution in is different from that in , the proposed algorithm should adjust its parameters in a computationally efficient way.
Iv Experimental Setup
Iv-a Dataset
We choose 4 benchmark datasets used in previous studies.
-
Breast Cancer Wisconsin (BCW) [bennett1992robust]: The features describe characteristics of the cell nuclei present in the image of a fine needle aspirate (FNA) of a breast mass. It is worth noting that BCW is an imbalanced dataset, and the ratio of positive class to negative class is around 2:8. The dataset is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin.
-
Default of Credit Card Clients (DCC) [DBLP:journals/eswa/YehL09a]: This dataset contains information on credit card clients in Taiwan from April 2005 to September 2005. The dataset is available at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
-
Epsilon (ESP): EPS is a dataset of mock data, and ESP5k is a modified version of the ESP dataset used in FATE. The dataset is available at https://github.com/FederatedAI/FATE/blob/master/examples/data/README.md#epsilon_5k
-
Human Activity Recognition (HAR) [DBLP:conf/sensys/StisenBBPKDSJ15]: The database was built from the recordings of 30 study participants performing activities of daily living. The data is available at https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones.
The statistics of the datasets are shown in Table I.
Dataset | Sample | Feature | Class | |
---|---|---|---|---|
Train | Test | |||
BCW | 453 | 114 | 32 | 2 |
DCC | 24,000 | 8,000 | 24 | 2 |
EPS5k | 4,000 | 1,000 | 100 | 2 |
HAR | 8,239 | 2,060 | 561 | 6 |
Iv-B Parameter setting
The parameter setting in our experiment is as follows. The length of the representation in party p , . The encoder in party A is implemented by a one-layer neural network. The number of hidden units in the neural network is 100 for datasets DCC and EPS5k, and 500 for datasets BCW and HAR. REN is implemented by a 4 layer neural network, in which each layer has 40 hidden units. The perturbing magnitude for dataset DCC, BCW, EPS5K and HAR are 0.6, 1, 0.6, 0.5, respectively. The batch size is set to 128. The learning rate of experiments on DCC, BCW, EPS5k is 0.005, and that of experiments on HAR is 0.001. Temperature scalar , parameter .
V Results
We now analyze the results to answer several research questions. Marco-P, Marco-R, and Marco-F1 are used as our evaluation metrics since BCW and DCC are label imbalanced datasets. We use 5-fold cross validation in our experiments. Our experiments are conducted on a machine running Linux with NVIDIA 1080.
V-a RQ1: How does DVFL perform compared to other VFL methods in static settings?
Non-Fed | Hetero-NN | Hetero-SBt | DVFL | |||
---|---|---|---|---|---|---|
without B | with B | |||||
BCW | P | 0.8479 | 0.9664 | 0.9320 | 0.9444 | 0.9484 |
R | 0.9369 | 0.9667 | 0.9436 | 0.9563 | 0.9461 | |
F1 | 0.8399 | 0.9661 | 0.9372 | 0.9497 | 0.9465 | |
DCC | P | 0.6938 | 0.7320 | 0.5819 | 0.6478 | 0.7151 |
R | 0.6477 | 0.6715 | 0.6874 | 0.7572 | 0.6719 | |
F1 | 0.6624 | 0.6902 | 0.5913 | 0.6731 | 0.6859 | |
EPS5k | P | 0.5523 | 0.6133 | 0.6051 | 0.5967 | 0.6085 |
R | 0.5523 | 0.6103 | 0.6160 | 0.5967 | 0.6052 | |
F1 | 0.5521 | 0.6074 | 0.5969 | 0.5967 | 0.6025 | |
HAR | P | 0.6659 | 0.9009 | 0.5015 | 0.8722 | 0.8982 |
R | 0.6669 | 0.8989 | 0.5160 | 0.8712 | 0.8947 | |
F1 | 0.6483 | 0.898 | 0.4293 | 0.8715 | 0.8936 |
We use the following methods as baselines in static scenarios (i.e., =0).
-
Hetero-NN [DBLP:conf/ijcai/ZhangWWXP18]: Hetero-Neural Network is a neural network-based VFL method implemented in FATE111https://github.com/FederatedAI/FATE. For each dataset we use, Hetero-NN has corresponding parameter settings in FATE (https://github.com/FederatedAI/FATE/tree/master/examples/benchmark˙quality/hetero˙nn). Thus we use these settings in our experiments.
-
Hetero-SBt [DBLP:journals/corr/abs-1901-08755]: Hetero-Secure Boost is a decision tree-based VFL implemented in FATE. For each dataset we used, Hetero-SBt has corresponding parameter settings in FATE (https://github.com/FederatedAI/FATE/tree/master/examples/benchmark˙quality/hetero˙sbt). Thus we use these settings in our experiments.
-
Non-federated without party B: This model consists of an auto-encoding module and a classification module. The implementation of the module is the same as the auto encoding module and classification module in DVFL but only uses the data on party A for prediction. The result can be regarded as the lower bound of DVFL.
-
Non-federated with party B
: This model consists of an auto-encoding module and a classification module. The implementation of the module is the same as the auto encoding module and classification module in DVFL. But the encoded data of party A and party B is simply concatenated and then be input into the classifier. The result of this model can be (roughly) regarded as the upper bound of DVFL.
The results are displayed in Table II. From the table, we have the following observations.
First, compared with other VFL methods, DVFL obtains the best F1 scores on two datasets (i.e., DCC, EPS5k) while Hetero-SB has the best F1 score on DCC. In general, when the prediction task is more complex (e.g., more features or more types of labels), the advantages of DVFL are more significant.
Second, we can notice that Hetero-SBt has the highest recall rate on the label imbalanced datasets DCC and BCW. This is because Hetero-SBt is a tree-based method whose hierarchical structure allows it to learn signals from both classes. However, the precision of the tree-based method is lower than that of the neural network-based approach, which affects the overall F1 score.
Third, the performance of Hetero-NN is not good, partly because it involves many encryption and decryption operations. With limited computing resources, it can only support simple models (such as fewer neural network layers and hidden units), which is insufficient for complex datasets.
V-B RQ2: Does DVFL perform well in dynamic data with different data distributions?
Mode | Timestamp | Class Ratio (Pos:Neg) | macro-F1 | |||
---|---|---|---|---|---|---|
Retrain | Fine-Tune | DVFL(Ours) | Joint Training | |||
Random | 0 | 16.7% : 16.7% (5:5) | 0.572 | 0.572 | 0.572 | 0.572 |
1 | 23.3% :10.0% (7:3) | 0.462 | 0.460 | 0.547 | 0.586 | |
2 | 20.0% : 13.3% (6:4) | 0.550 | 0.522 | 0.546 | 0.555 | |
3 | 3.3% : 30.0% (1:9) | 0.343 | 0.347 | 0.607 | 0.593 | |
4 | 13.3% : 20.0% (4:6) | 0.581 | 0.561 | 0.580 | 0.602 | |
5 | 23.3% : 10.0% (7:3) | 0.520 | 0.450 | 0.598 | 0.585 | |
Asc vs Des | 0 | 20.0% : 20.0% (1:1) | 0.598 | 0.598 | 0.598 | 0.598 |
1 | 28.8% : 3.2% (9:1) | 0.406 | 0.368 | 0.585 | 0.604 | |
2 | 22.4% : 9.6% (7:3) | 0.542 | 0.453 | 0.561 | 0.540 | |
3 | 16% : 16% (5:5) | 0.602 | 0.607 | 0.584 | 0.603 | |
4 | 9.6% : 22.4% (3:7) | 0.501 | 0.469 | 0.562 | 0.615 | |
5 | 3.2% : 28.8% (1:9) | 0.378 | 0.332 | 0.613 | 0.606 | |
Parallel | 0 | 50% : 20% (5:2) | 0.545 | 0.545 | 0.545 | 0.545 |
1 | 10% : 16% (5:8) | 0.409 | 0.524 | 0.590 | 0.564 | |
2 | 10% : 16% (5:8) | 0.562 | 0.547 | 0.609 | 0.612 | |
3 | 10% : 16% (5:8) | 0.336 | 0.543 | 0.611 | 0.634 | |
4 | 10% : 16% (5:8) | 0.541 | 0.508 | 0.629 | 0.573 | |
5 | 10% : 16% (5:8) | 0.572 | 0.558 | 0.605 | 0.616 | |
Uniform | 0 | 25% : 25% (1:1) | 0.599 | 0.599 | 0.599 | 0.599 |
1 | 15% : 15% (1:1) | 0.579 | 0.602 | 0.580 | 0.586 | |
2 | 15% : 15% (1:1) | 0.575 | 0.609 | 0.590 | 0.610 | |
3 | 15% : 15% (1:1) | 0.574 | 0.614 | 0.587 | 0.602 | |
4 | 15% : 15% (1:1) | 0.580 | 0.612 | 0.580 | 0.607 | |
5 | 15% : 15% (1:1) | 0.584 | 0.619 | 0.587 | 0.603 |
We evaluated the performance of DVFL under differently distributed data streams on the EPS5k dataset. EPS5k is a dataset for a binary classification task. In our experiment, we assume that the data of party B arrives in times. At timestamp , party B obtains the dataset . Our task is to use and the corresponding data in party A to train the classifier of DVFL in a privacy-preserving manner.
To measure the performance of DVFL under the different distribution of data streams, we use the following 4 modes of data distributions:
-
Random: For each timestamp, the ratio of positive and negative examples in the new data is random.
-
Asc vs Des: Over time, the number of positive examples in the new data gradually increases, while the number of negative examples gradually decreases.
-
Parallel: At each timestamp, the ratio of positive and negative examples in the new data is the same but imbalanced.
-
Uniform: At each timestamp, the ratio of positive and negative examples in the new data is 1:1.
Finally, we use the trained classifier to classify the items in the test set and record the results. In the test set, the ratio of positive and negative examples is roughly 1:1. We use the following model update methods as our baselines:
-
Fine-tuning: Fine-tuning uses the new dataset to tune the current classifier with a small learning rate (0.1 times the original learning rate);
-
Joint Training: Using All uses all previously shown data to train a new classifier, which should be the highest possible result in most cases.
-
Retrain: Retrain uses only the newly arrived dataset to train a new classifier.
The results are in Table III. From the table, we have the following observations.
First, in different modes, the performance of DVFL is much better than retrain and fine-tuning, especially when the data distribution of the new data is very different from that of the old data. This means that DVFL has better adaptability to changes in the data distribution of dynamic data.
Second, in the mode where the data distribution is relatively stable (e.g., parallel, uniform), DVFL also has good performance. However, the performance difference between different methods in these modes is small, especially in the uniform mode. In theory, the performance of fine-tuning, joint training, and DVFL in uniform mode is basically the same.
Third, when the data distribution of training data and test data is similar, the performance of joint training is the best. However, the training time required for joint training is much longer than of other methods. This is because that at each timestamp, joint training works on the entire dataset, while the rest of the methods only works on the new data. It is worth noticing that the performance of joint training in Table III is not the best in all cases. This is because the ratio of positive and negative examples in the test set is close to 1:1, but since the data in party B is dynamically increasing, there is a difference in the label distribution of the test data and the training data at some specific timestamps. When , party B obtains all the data.
![]() |
![]() |
To further evaluate the performance of DVFL on other datasets, we tested its results of random mode on BCW and DCC. As Fig. 1, the results on these two datasets are consistent with the results on EPS5k.
V-C RQ3: How does DVFL perform when the number of clients in the passive party increases?
Passive client # | P | R | F1 |
---|---|---|---|
1 | 0.6300 | 0.6698 | 0.6280 |
2 | 0.5927 | 0.7382 | 0.5921 |
4 | 0.6036 | 0.7138 | 0.6084 |
6 | 0.5964 | 0.6934 | 0.5964 |
8 | 0.6078 | 0.6629 | 0.5839 |
10 | 0.5803 | 0.7637 | 0.5821 |
To evaluate the scalability of DVFL, we tested the performance when the passive party has multiple clients. Specifically, we measure the scalability of DVFL on the EPS5K dataset, and the number of passive clients ranges from 1 to 10. Each client on the passive side has of . The results are demonstrated in Table IV. It can be seen from the table that when the number of clients on the passive party increases, the performance of the system is still stable.
Vi Conclusion
This paper proposes Dynamic Vertical Federated Learning (DVFL), a vertical federated learning method for dynamic data. Specifically, we use feature representation estimation and correction to enhance the data representation in the active party and then train a classifier on the active party for classification. DVFL is applicable for both dynamic scenarios and static scenarios. In a dynamic scenario, the data of the passive party increases dynamically, and the distribution of the data arriving at each timestamp may be different. The experimental results show that in the different distribution changes of dynamic data, DVFL is significantly better than fine-tuning and retrain in most cases. The performance of DVFL is slightly worse than joint training, but joint training is much slower than DVFL. A static scenario can be regarded as a special case of a dynamic scenario: all the data obtained by party B from the beginning. Experimental results show that the performance of DVFL in the static scenario is also competitive with baseline methods.