A new machine learning paradigm, federated learningDBLP:journals/corr/McMahanMRA16, has emerged to be an attractive solution to the data silo and privacy problem. The original federated learning framework focus on enabling large amount of parties (devices) to collaboratively train a model without sharing their personal data. This framework is also referred as horizontal federated learning (HFL) DBLP:journals/corr/abs-1902-04885 and it is later extended DBLP:journals/corr/abs-1902-04885
to consider cross-organizational collaborative learning problems where parties share the same users with different set of features, and this scenario is classified asvertical federated learning (VFL) DBLP:journals/corr/abs-1902-04885; Hardy2017PrivateFL. Existing architectures for VFL still face several critical challenges and communication overhead is a major bottleneck since privacy-preserving computations, such as Homomorphic Encryption (HE) Rivest1978; Acar:2018:SHE:3236632.3214303 and Multi-party Computation (SMPC) Yao:1982:PSC:1382436.1382751, are typically applied to transmitted data, and per-iteration privacy-preserving communication and computations are required. In DBLP:journals/corr/McMahanMRA16, it is demonstrated experimentally that multiple local updates can be performed in HFL with federated averaging (FedAvg), reducing the number of communication round effectively. Whether it is feasible to perform such multiple local updates strategy in the VFL scenario is unknown, because in VFL each party only possesses a subset of all the features and only one party has the label information.
In this paper, we propose an algorithm named Federated stochastic block coordinate descent (FedBCD), where parties can continuously perform local model updates (in either a parallel or sequential manner), and only need to get synced occasionally. Block coordinate (gradient) descent (BCD) is a classical algorithm for optimization bertsekas99 and has been extensively applied to applications in areas such as signal/image processing and machine learning meisam14nips; Razaviyayn12SUM; WataoYinBCD; peng15; niu11; Beck13; Wright15; hong15busmm_spm. However, BCD and its variant has not been applied to the FL setting. We demonstrate that the communication cost can be significantly reduced by adopting FedBCD and performed comprehensive convergence analysis and experimental evaluation.
2 Problem Definition
Suppose data owners collaboratively train a machine learning model based on a set of data
. Suppose that the feature vectorcan be further decomposed into blocks , where each block belongs to one owner. Without loss of generality, assume that the labels are located in party . Let us denote the data set as , for , , and (where denotes the set ). Then the collaborative training problem can be formulated as
where denotes the training parameters of the th party; ; denotes the total number of training samples; and
denotes the loss function and regularizer andis the hyperparatemer; For a mini-batch of data , we use to denote its loss function.
A direct approach to optimize (1
) is to use the vanilla stochastic gradient descent (SGD) algorithm given below
where denotes the stochastic partial gradient w.r.t. for (1). denotes information required from other parties to compute . We refer to the federated implementation of the vanilla SGD as FedSGD, which requires pair-wise communication of intermediate results at every iteration. This could be very inefficient, especially when is large or the task is communication heavy.
3 The Proposed FedBCD Algorithms
In the parallel version of our proposed algorithm, called FedBCD-p, at each iteration, each party performs consecutive local gradient updates in parallel, before communicating the intermediate results among each other; see Algorithm 1. Such “multi-local-step" strategy is strongly motivated by our practical implementation (to be shown in our Experiments Section), where we found that performing multiple local steps can significantly reduce overall communication cost. Further, such a strategy also resembles the FedAvg algorithm in HFL, where each agent performs multiple local steps to update the full features. In the same spirit, a sequential version of the algorithm allows the parties to update their local ’s sequentially, while each update consists of local updates without inter-party communication, termed FedBCD-s.
4 Convergence Analysis
Due to space limitation, our analysis will be focused on FedBCD-p. Let denote the iteration index, in which each iteration one round of local update is performed; Let denote the latest iteration before in which synchronization has been performed. Further, we use the “global" variable to collect the most updated parameters at each iteration of each node.
Assumption 1 (A1): Lipschitz Gradient. Assume that the loss function satisfies the following:
Assumption 2 (A2): Uniform Sampling. Assume that the data is partitioned into mini-batches , each with size ; at a given iteration, is sampled uniformly from these mini-batches.
Under Lipschitz Gradient and Uniform Sampling assumptions, when the step size in FedBCD algorithm satisfies , then for all , we have the following bound:
where denotes the global minimum of problem (1).
Remark 1. If we pick , , with any fixed the convergence speed is . This indicates that to achieve the same error compared with FedSGD, the communication rounds in the proposed algorithm can be reduced by a factor . To the best of our knowledge, it is the first time that such an rate has been proven for any algorithms with multiple local steps designed for the feature-partitioned collaboratively learning problem.
Remark 2. If we consider the impact of the number of nodes and pick , , then the convergence speed is . This indicates that the proposed algorithm has a slow down w.r.t the number of parties involved.
MIMIC-III. MIMIC-III johnson2016mimic is a large database comprising information related to patients admitted to critical care units at a large tertiary care hospital. Following the data processing procedures of harutyunyan2017multitask, we obtain 714 features. We partition each sample vertically by its clinical features and perform an in-hospital mortality prediction task. We refer to this task as MIMIC-LR.
NUS-WIDE. The NUS-WIDE dataset Chua09nus-wide:a
consists of low-level images features and text tag features extracted from Flickr images. We put 634 low-level image features on party B and 1000 textual tag features with ground truth labels on party A. The objective is to perform a federated transfer learning (FTL) studied inDBLP:journals/corr/abs-1812-03337. We refer to this task as NUS-FTL.
Default-Credit. The Default-Credit Yehcredit2009 consists of credit card records. In our experiments, party A has labels and 18 features including six months of payment and bill balance data, whereas party B has 15 features of user profile data. We perform a FTL task as Credit-FTL.
5.1 Experimental Results
For all experiments, we adopt a decay learning rate strategy with , where is optimized for each experiment. We observe similar convergence for FedBCD-p (Figure 1) and FedBCD-s (Figure 1) for various values of . By reasonably increasing the number of local iteration, we can save the overall communication costs by reducing the number of total communication rounds required. As we increase the number of parties to five and seventeen, the proposed method still performs well when we increase the local iterations for multiple parties. FedBCD-p is slightly slower than the two-party case, but the impact of node is very mild. To further investigate the relationship between the convergence rate and the local iteration , we evaluate FedBCD-p algorithm on NUS-FTL with a large range of . Figure 1 illustrates that FedBCD-p achieves the best AUC with the least number of communication rounds when . For each target AUC, there exists an optimal . This manifests that one needs to carefully select to achieve the best communication efficiency, as suggested by Theorem 1. Figure 1 shows that for very large local iterations, the FedBCD-p cannot converge to the AUC of . This phenomenon is also supported by Theorem 1, where if is too large the right hand side of (3) may not go to zero.
Proximal Gradient Descent
We add a proximal term TianLi2019 when calculating gradients to alleviate potential divergence when local iteration is large. We denote the proximal version of FedBCD-p as FedPBCD-p, and apply FedPBCD-p with to NUS-FTL for , 50 and 100. Figure 1 illustrates that if Q is too large, FedBCD-p fails to converge to optimal solutions whereas the FedPBCD-p converges faster and can reach at a higher test AUC than corresponding FedBCD-p does.
Implementation with HE
We investigate the efficiency of FedBCD-p running on an industrial VFL platform, FATE111https://github.com/FederatedAI/FATE, with homomorphic encryption (HE) applied using the Credit-FTL task. Note carefully selecting Q may reduce communication rounds but may also introduce computational overhead because the total number of local iterations may increase. Table 1 shows that FedBCD-p with larger Q costs less communication rounds and total training time with a mild increase in computation time but more than 70 percents reduction in communication round.
In this paper, we propose a framework to significantly reduce the number of communication rounds, a major bottleneck for vertical federated learning (VFL). We prove that the algorithm achieves global convergence with a decay learning rate and proper choice of .