Log In Sign Up

DVFL: A Vertical Federated Learning Method for Dynamic Data

by   Yuzhi Liang, et al.

Federated learning, which solves the problem of data island by connecting multiple computational devices into a decentralized system, has become a promising paradigm for privacy-preserving machine learning. This paper studies vertical federated learning (VFL), which tackles the scenarios where collaborating organizations share the same set of users but disjoint features. Contemporary VFL methods are mainly used in static scenarios where the active party and the passive party have all the data from the beginning and will not change. However, the data in real life often changes dynamically. To alleviate this problem, we propose a new vertical federation learning method, DVFL, which adapts to dynamic data distribution changes through knowledge distillation. In DVFL, most of the computations are held locally to improve data security and model efficiency. Our extensive experimental results show that DVFL can not only obtain results close to existing VFL methods in static scenes, but also adapt to changes in data distribution in dynamic scenarios.


page 1

page 2

page 3

page 4


Privacy Preserving Vertical Federated Learning for Tree-based Models

Federated learning (FL) is an emerging paradigm that enables multiple or...

Interpret Federated Learning with Shapley Values

Federated Learning is introduced to protect privacy by distributing trai...

Hijack Vertical Federated Learning Models with Adversarial Embedding

Vertical federated learning (VFL) is an emerging paradigm that enables c...

Privacy Against Inference Attacks in Vertical Federated Learning

Vertical federated learning is considered, where an active party, having...

Parallel Distributed Logistic Regression for Vertical Federated Learning without Third-Party Coordinator

Federated Learning is a new distributed learning mechanism which allows ...

Vertical Semi-Federated Learning for Efficient Online Advertising

As an emerging secure learning paradigm in leveraging cross-silo private...

Local Learning at the Network Edge for Efficient Secure Real-Time Predictive Analytics

The ability to perform computation on devices, such as smartphones, cars...

I Introduction

As an emerging machine learning paradigm, federated learning (FL) enables data owners to collaboratively train models by sharing gradients instead of raw data. The core idea of federated learning is to let each client perform calculations locally on its data to obtain certain intermediate results (e.g., gradients) and then exchange the results with other clients in a secure manner. Existing FL work mainly focuses on the horizontal setting, including designing different model aggregation algorithms [DBLP:journals/isci/ChenRT18, DBLP:conf/aistats/McMahanMRHA17, DBLP:conf/nips/DinhTN20] or solving data Non-IID issues [DBLP:conf/aaai/HuangCZWLPZ21, DBLP:conf/iclr/LiJZKD21].

There are a few studies on vertical federated learning (VFL). In VFL, the participants share the same example ID space but are different in feature space. Existing VFL research mainly implements different machine learning algorithms, such as decision trees

[DBLP:journals/pvldb/WuCXCO20, DBLP:journals/corr/abs-1901-08755]

and deep learning

[DBLP:journals/corr/abs-2008-10838, DBLP:conf/ijcai/ZhangWWXP18], in the context of data privacy-preserving. Nonetheless, the existing VFL algorithm has the following problems. First, some existing methods (e.g.,[DBLP:conf/ijcai/ZhangWWXP18]) involve a large amount of data interaction between the active party and the passive party and use homomorphic encryption to encrypt data to ensure data security, which requires a lot of computing resources. Second, the existing VFL only considers static scenarios, that is, the participants in the federated learning have all the data from the beginning, and it does not change. However, in real life, data usually grows dynamically, making the overlapping samples between participants in the VFL continue to increase. Intuitively, machine learning methods designed for static scenarios can be updated by fine-tuning, but fine-tuning can only work under the assumption that the distribution of new and old data is similar. This assumption is not always true in real life. In many scenarios, the distribution of new data is different from that of the original data. In these cases, using fine-tuning to update the model will encounter the problem of catastrophic forgetting. Specifically, when the data distribution of the new data is different from the old data, it means that the model needs to acquire knowledge from the non-stationary data distribution, and the new knowledge will interfere with the old knowledge. Then, using fine-tuning to update the model will cause the model to overwrite or forget the knowledge learned from the old data.

To alleviate the problems mentioned above, we propose a novel VFL method for dynamic data named Dynamic Vertical Federated Learning (DVFL for short). Compared to previous methods, the contributions of our work are:

  • DVFL is suitable for dynamic scenarios of vertical federated learning, that is, participants do not acquire all the data at the beginning, the data increases dynamically, and the data distribution of the new data is not necessarily the same as that of the old data.

  • In DVFL, model training is performed locally as much as possible, which can reduce the interaction between parties, thereby improving data security and model efficiency.

  • DVFL does not require participants to share their original data or data encoded by a single neural network.

To evaluate the performance of DVFL in different scenarios, we conducted a lot of experiments on benchmark data sets. The results show that the performance of DVFL in static scenarios is comparable to that of the baseline methods, and it has high efficiency and effectiveness in dynamic scenarios.

Ii Related Work

Vertical Federated Learning (VFL) refers to the technology of federated learning under the setting of different feature spaces for all parties. Different from horizontal federated learning that each client can calculate the loss independently, VFL requires multiple parties to complete the calculation and optimization of the loss function under the framework of security and confidentiality. The existing VFL methods can be divided into linear-based methods, tree-based methods, kernel-based methods, and neural network-based methods. Linear model-based VFL methods include

[DBLP:journals/iacr/GasconSB0DZE16, DBLP:journals/corr/abs-1711-10677, DBLP:conf/sp/MohasselZ17]. They use hybrid MPC (secure multi-party computing) protocol [DBLP:conf/focs/Yao82b] or additive homomorphic encryption [DBLP:conf/eurocrypt/BrickellY87] for secure linear model training. Tree-based VFL models include [DBLP:journals/pvldb/WuCXCO20, DBLP:journals/corr/abs-1901-08755]. They enable participating parties to collaboratively build a tree or an forest without information leakage by designing sepecial protocols. Kernel-based VFL methods include [DBLP:series/lncs/DangGH20, DBLP:conf/kdd/GuDLH20], they approximate the kernel function and federatedly updated the prediction function by the designed gradient. Neural network-based methods include [DBLP:journals/corr/abs-2008-10838, DBLP:conf/ijcai/ZhangWWXP18]. These methods use the active and passive parties to calculate the loss to optimize parameters. Homomorphic encryption is often used to ensure information security.

Iii Problem Statement

We consider the problem of dynamic vertical federated learning. Let be the dataset distributed on different parties and the examples are aligned by using encrypted entity alignment techniques [DBLP:journals/corr/abs-1803-04035]. The active party A holds a dataset and the label , where , is the number of classes. The passive party B holds a dataset whose size increases over time. At timestamp , party B holds dataset , where and . The increased data of party B from to is . Our goal is to design an algorithm that satisfies the following restrictions.

  1. and cannot be exposed to each other.

  2. uses the data of under the privacy protection setting to help improve the performance of the classification model.

  3. The proposed algorithm should be able to adapt to the dynamic changes of the passive dataset. At each timestamp , even if the data distribution in is different from that in , the proposed algorithm should adjust its parameters in a computationally efficient way.

Iv Experimental Setup

Iv-a Dataset

We choose 4 benchmark datasets used in previous studies.

The statistics of the datasets are shown in Table I.

Dataset Sample Feature Class
Train Test
BCW 453 114 32 2
DCC 24,000 8,000 24 2
EPS5k 4,000 1,000 100 2
HAR 8,239 2,060 561 6
Bang I: Dataset statistics

Iv-B Parameter setting

The parameter setting in our experiment is as follows. The length of the representation in party p , . The encoder in party A is implemented by a one-layer neural network. The number of hidden units in the neural network is 100 for datasets DCC and EPS5k, and 500 for datasets BCW and HAR. REN is implemented by a 4 layer neural network, in which each layer has 40 hidden units. The perturbing magnitude for dataset DCC, BCW, EPS5K and HAR are 0.6, 1, 0.6, 0.5, respectively. The batch size is set to 128. The learning rate of experiments on DCC, BCW, EPS5k is 0.005, and that of experiments on HAR is 0.001. Temperature scalar , parameter .

V Results

We now analyze the results to answer several research questions. Marco-P, Marco-R, and Marco-F1 are used as our evaluation metrics since BCW and DCC are label imbalanced datasets. We use 5-fold cross validation in our experiments. Our experiments are conducted on a machine running Linux with NVIDIA 1080.

V-a RQ1: How does DVFL perform compared to other VFL methods in static settings?

Non-Fed Hetero-NN Hetero-SBt DVFL
without B with B
BCW P 0.8479 0.9664 0.9320 0.9444 0.9484
R 0.9369 0.9667 0.9436 0.9563 0.9461
F1 0.8399 0.9661 0.9372 0.9497 0.9465
DCC P 0.6938 0.7320 0.5819 0.6478 0.7151
R 0.6477 0.6715 0.6874 0.7572 0.6719
F1 0.6624 0.6902 0.5913 0.6731 0.6859
EPS5k P 0.5523 0.6133 0.6051 0.5967 0.6085
R 0.5523 0.6103 0.6160 0.5967 0.6052
F1 0.5521 0.6074 0.5969 0.5967 0.6025
HAR P 0.6659 0.9009 0.5015 0.8722 0.8982
R 0.6669 0.8989 0.5160 0.8712 0.8947
F1 0.6483 0.898 0.4293 0.8715 0.8936
Bang II: Performance comparison of VFL method in static scenarios

We use the following methods as baselines in static scenarios (i.e., =0).

  • Hetero-NN [DBLP:conf/ijcai/ZhangWWXP18]: Hetero-Neural Network is a neural network-based VFL method implemented in FATE111 For each dataset we use, Hetero-NN has corresponding parameter settings in FATE (˙quality/hetero˙nn). Thus we use these settings in our experiments.

  • Hetero-SBt [DBLP:journals/corr/abs-1901-08755]: Hetero-Secure Boost is a decision tree-based VFL implemented in FATE. For each dataset we used, Hetero-SBt has corresponding parameter settings in FATE (˙quality/hetero˙sbt). Thus we use these settings in our experiments.

  • Non-federated without party B: This model consists of an auto-encoding module and a classification module. The implementation of the module is the same as the auto encoding module and classification module in DVFL but only uses the data on party A for prediction. The result can be regarded as the lower bound of DVFL.

  • Non-federated with party B

    : This model consists of an auto-encoding module and a classification module. The implementation of the module is the same as the auto encoding module and classification module in DVFL. But the encoded data of party A and party B is simply concatenated and then be input into the classifier. The result of this model can be (roughly) regarded as the upper bound of DVFL.

The results are displayed in Table II. From the table, we have the following observations.

First, compared with other VFL methods, DVFL obtains the best F1 scores on two datasets (i.e., DCC, EPS5k) while Hetero-SB has the best F1 score on DCC. In general, when the prediction task is more complex (e.g., more features or more types of labels), the advantages of DVFL are more significant.

Second, we can notice that Hetero-SBt has the highest recall rate on the label imbalanced datasets DCC and BCW. This is because Hetero-SBt is a tree-based method whose hierarchical structure allows it to learn signals from both classes. However, the precision of the tree-based method is lower than that of the neural network-based approach, which affects the overall F1 score.

Third, the performance of Hetero-NN is not good, partly because it involves many encryption and decryption operations. With limited computing resources, it can only support simple models (such as fewer neural network layers and hidden units), which is insufficient for complex datasets.

V-B RQ2: Does DVFL perform well in dynamic data with different data distributions?

Mode Timestamp Class Ratio (Pos:Neg) macro-F1
Retrain Fine-Tune DVFL(Ours) Joint Training
Random 0  16.7% : 16.7% (5:5) 0.572 0.572 0.572 0.572
1 23.3% :10.0% (7:3) 0.462 0.460 0.547 0.586
2 20.0% : 13.3% (6:4) 0.550 0.522 0.546 0.555
3   3.3% : 30.0% (1:9) 0.343 0.347 0.607 0.593
4 13.3% : 20.0% (4:6) 0.581 0.561 0.580 0.602
5 23.3% : 10.0% (7:3) 0.520 0.450 0.598 0.585
Asc vs Des 0 20.0% : 20.0% (1:1) 0.598 0.598 0.598 0.598
1 28.8% : 3.2% (9:1) 0.406 0.368 0.585 0.604
2 22.4% : 9.6% (7:3) 0.542 0.453 0.561 0.540
3 16% : 16% (5:5) 0.602 0.607 0.584 0.603
4 9.6% : 22.4% (3:7) 0.501 0.469 0.562 0.615
5 3.2% : 28.8% (1:9) 0.378 0.332 0.613 0.606
Parallel 0 50% : 20% (5:2) 0.545 0.545 0.545 0.545
1 10% : 16% (5:8) 0.409 0.524 0.590 0.564
2 10% : 16% (5:8) 0.562 0.547 0.609 0.612
3 10% : 16% (5:8) 0.336 0.543 0.611 0.634
4 10% : 16% (5:8) 0.541 0.508 0.629 0.573
5 10% : 16% (5:8) 0.572 0.558 0.605 0.616
Uniform 0 25% : 25% (1:1) 0.599 0.599 0.599 0.599
1 15% : 15% (1:1) 0.579 0.602 0.580 0.586
2 15% : 15% (1:1) 0.575 0.609 0.590 0.610
3 15% : 15% (1:1) 0.574 0.614 0.587 0.602
4 15% : 15% (1:1) 0.580 0.612 0.580 0.607
5 15% : 15% (1:1) 0.584 0.619 0.587 0.603
Bang III: Performance comparison of different model update methods in dynamic scenarios

We evaluated the performance of DVFL under differently distributed data streams on the EPS5k dataset. EPS5k is a dataset for a binary classification task. In our experiment, we assume that the data of party B arrives in times. At timestamp , party B obtains the dataset . Our task is to use and the corresponding data in party A to train the classifier of DVFL in a privacy-preserving manner.

To measure the performance of DVFL under the different distribution of data streams, we use the following 4 modes of data distributions:

  • Random: For each timestamp, the ratio of positive and negative examples in the new data is random.

  • Asc vs Des: Over time, the number of positive examples in the new data gradually increases, while the number of negative examples gradually decreases.

  • Parallel: At each timestamp, the ratio of positive and negative examples in the new data is the same but imbalanced.

  • Uniform: At each timestamp, the ratio of positive and negative examples in the new data is 1:1.

Finally, we use the trained classifier to classify the items in the test set and record the results. In the test set, the ratio of positive and negative examples is roughly 1:1. We use the following model update methods as our baselines:

  • Fine-tuning: Fine-tuning uses the new dataset to tune the current classifier with a small learning rate (0.1 times the original learning rate);

  • Joint Training: Using All uses all previously shown data to train a new classifier, which should be the highest possible result in most cases.

  • Retrain: Retrain uses only the newly arrived dataset to train a new classifier.

The results are in Table  III. From the table, we have the following observations.

First, in different modes, the performance of DVFL is much better than retrain and fine-tuning, especially when the data distribution of the new data is very different from that of the old data. This means that DVFL has better adaptability to changes in the data distribution of dynamic data.

Second, in the mode where the data distribution is relatively stable (e.g., parallel, uniform), DVFL also has good performance. However, the performance difference between different methods in these modes is small, especially in the uniform mode. In theory, the performance of fine-tuning, joint training, and DVFL in uniform mode is basically the same.

Third, when the data distribution of training data and test data is similar, the performance of joint training is the best. However, the training time required for joint training is much longer than of other methods. This is because that at each timestamp, joint training works on the entire dataset, while the rest of the methods only works on the new data. It is worth noticing that the performance of joint training in Table III is not the best in all cases. This is because the ratio of positive and negative examples in the test set is close to 1:1, but since the data in party B is dynamically increasing, there is a difference in the label distribution of the test data and the training data at some specific timestamps. When , party B obtains all the data.

(a) BCW
(b) DCC
Hình 1: Comparison of different model update methods on DCC and BCW (random mode)

To further evaluate the performance of DVFL on other datasets, we tested its results of random mode on BCW and DCC. As Fig. 1, the results on these two datasets are consistent with the results on EPS5k.

V-C RQ3: How does DVFL perform when the number of clients in the passive party increases?

Passive client # P R F1
1 0.6300 0.6698 0.6280
2 0.5927 0.7382 0.5921
4 0.6036 0.7138 0.6084
6 0.5964 0.6934 0.5964
8 0.6078 0.6629 0.5839
10 0.5803 0.7637 0.5821
Bang IV: Scalability evaluation of DVFL

To evaluate the scalability of DVFL, we tested the performance when the passive party has multiple clients. Specifically, we measure the scalability of DVFL on the EPS5K dataset, and the number of passive clients ranges from 1 to 10. Each client on the passive side has of . The results are demonstrated in Table IV. It can be seen from the table that when the number of clients on the passive party increases, the performance of the system is still stable.

Vi Conclusion

This paper proposes Dynamic Vertical Federated Learning (DVFL), a vertical federated learning method for dynamic data. Specifically, we use feature representation estimation and correction to enhance the data representation in the active party and then train a classifier on the active party for classification. DVFL is applicable for both dynamic scenarios and static scenarios. In a dynamic scenario, the data of the passive party increases dynamically, and the distribution of the data arriving at each timestamp may be different. The experimental results show that in the different distribution changes of dynamic data, DVFL is significantly better than fine-tuning and retrain in most cases. The performance of DVFL is slightly worse than joint training, but joint training is much slower than DVFL. A static scenario can be regarded as a special case of a dynamic scenario: all the data obtained by party B from the beginning. Experimental results show that the performance of DVFL in the static scenario is also competitive with baseline methods.

Tài li.u