1 Introduction
Federated learning (FL) is an emerging machine learning framework where multiple clients (e.g., mobile devices or organizations) collaboratively train a machine learning (ML) model McMahan et al. (2017). FL specifically addresses the new challenges including the difﬁculty of synchronizing multiple clients, the heterogeneity of data, and the privacy and security of clients’ data and, in some settings, also their local models. Due to these challenges, classic ML methods cannot be directly applied Kairouz et al. (2019).
A popular FL setting that partitions data among clients is called horizontal FL (HFL). Each client has data of a different set of subjects, and the data of every client have the same set of features Konečnỳ et al. (2016); Li et al. (2020). Examples of such data include smartphone users’ wordtyping histories (from the same word dictionary), which are stored on individual devices and analyzed by the same features McMahan et al. (2017). One can apply HFL to learn a model for word or sentence completion. In the setting known as vertical FL (VFL), it is features that are partitioned among clients, and all the clients share a common set of subjects Chen et al. (2020); Liu et al. (2019). More features help build a more accurate model than using fewer features. For example, VFL can help an insurance company better predict someone’s risk using not just this person’s records at this company but also records from multiple other insurance businesses.
Once training is completed, HFL and VFL also have different prediction processes. In HFL, the jointlytrained model is typically shared among the clients, so each client performs predictions alone. In VFL, while a client can predict using its local model based on its local features, more accurate predictions are made when more clients work together and use their jointly learned model that takes all the features available.
Motivation and main challenge. In practice, however, each client may possess only some subjects and some features. It is possible that no client has all features or all subjects. This is the case of financial institutions such as insurance providers, banks, and stock services providers, each of which may serve just a fraction of all customers and have only their partial records. This setting has been referred to as hybrid FL Konečnỳ et al. (2016); Li et al. (2020), and it is the setting we focus in this paper. Both HFL and VFL are special cases of hybrid FL. Compared to HFL and VFL, hybrid FL has its unique challenges. Some specific ones pertaining to algorithm design are:

Client customized models. In Hybrid FL, each client’s local data contain a subset of features, but in inference, it is possible that some clients may need to deal with data that has the full feature. So the server is often required to maintain a copy of the full feature.

Limited data sharing. In typical HFL, the clients do not share their local data or labels, but in VFL, the labels are either made available to the server Chen et al. (2020) or stored in a designated client Liu et al. (2019). A Hybrid FL system needs to deal with both types of clients, so it is desirable that the training algorithm can operate without requiring the server to access any data, including the labels.

Sample synchronization. A typical issue with VFL (in which each client has some features of all the training samples) is that, when updating a given subset of features, all the clients need to draw the same (minibath of) samples; this problem is exacerbated in the hybrid FL system because not all the clients will have all the samples. An ideal algorithm shall work without any synchronization on the samples among the clients.
All the above points will directly translate to specific challenges when we design optimization algorithms, and it will become clear that none of the existing FL methods can meet all these requirements.
Related work on HFL. In HFL, a common algorithm is FedAvg Konečnỳ et al. (2016), which adopts the computationthenaggregation strategy, that is, the clients locally perform a few steps of model updates, and then the server aggregates the updated local models and averages them before sending the updated global model back. Beyond model communication, MIME Karimireddy et al. (2020) and SCAFFOLD Karimireddy et al. (2019) also send local gradients and other statics to the server to achieve better convergence. Furthermore, PNFM Yurochkin et al. (2019) and FedMA Wang et al. (2020) use a parameter matching based strategy in place of the model averaging step to get better global model performance, and they do not require the global model to have the same size as the local models. All HFL algorithms assume their data have the same size and format.
Related work on VFL. In VFL, the features and thus models are separated on different clients Hardy et al. (2017); Ma et al. (2019); Liu et al. (2019); Chen et al. (2020). There are relatively few works on VFL. Federated Block Coordinate Descent (FedBCD) Liu et al. (2019) uses a parallel BCDlike algorithm to optimize the local blocks and transmits essential information for the other clients to compute their local gradients. Vertical Asynchronous Federated Learning (VAFL) Chen et al. (2020) assumes that the server holds the global inference model while local clients train the feature extractors that deal with the local features.
Our contributions. We summarize our main contributions as follows.

We propose a hybrid FL model that captures many key aspects of collaborativelearning scenarios, where neither the subject set nor the feature set is necessarily complete at a client. Such a formulation can be tailored to meet different requirements for specific hybrid FL models. To our knowledge, this is the first concrete hybrid FL model in the literature.

We develop a convergent hybrid FL algorithm that enables knowledge transfer among clients, which at the same time maintains data locality and improves communication efficiency (by removing the sample synchronization requirement).

We evaluate the performance of the hybrid FL algorithm on real data sets that learn a federated model with its achieved learning accuracy comparable to that learned in centralized settings.
2 Problem Formulation
We consider a total of samples, written as , where each has features . There are clients, where each has some samples and their partial features. The features have indices , sample data have indices , and clients have indices . If the th agent has the th sample, we write , where denotes the set of features known to the th agent.
A Generic Formulation. Consider a hybrid FL model consisting of an inference model and some feature extractors. For a given agent , if it has enough samples that are related to feature , then it will learn a feature extractor , which is approximately the same as a global feature extractor located at the server, i.e, . On the other hand, if it does not have enough sample containing feature , then it will not participate in learning . For a given agent , it will process an input sample by going through an embedding .
The embedding vector
then goes through the inference model , also learned by agent , and produces an output (label). During the aggregation step, the server may also create global copies of the aggregated models . The setting discussed above is illustrated in Figure 1.Use the above notation, let us first set up a highlevel problem formulation as below:
(1) 
where measures the accuracy of using the embedding and the local model to predict , and is a generic regularizer that encodes the prior knowledge about the global and local models. Such a formulation is general enough to capture several special Hybrid FL settings. For example it is easy to see that the exiting VFL and HFL settings are both its special cases. Next, we will demonstrate how problem (1) can be customized to a practical hybrid FL problem.
A Feature Matching Based Hybrid FL Formulation. Specifically, we assume that both features and labels are private, and they are not shared with other clients nor with the server. Further, assume that the feature extractors for the same feature are approximately consensual, that is, we require for every agent who updates the th feature extractor. In this case, we denote the concatenates as , and denote the local data set as . Then the objective function of (1) can be separated into a sum of the following local objectives:
(2) 
Here indicates the local regularizer for client that enforces the consensus among the local feature extractors and regularizes the difference of the local inference model . Our main design effort will be devoted to finding the proper regularizer , which has the following desired features: 1) It helps enforce the consensus of and ; 2) It facilitates the learning of a global inference model from the local inference models . Since these two tasks are relatively separable, then it is natural to expression as :
(3) 
Notice that we can use any reasonable distance function to construct since and have the same dimension. However, it is not straightforward to construct , for the following reasons. First, the dimension of is much larger than each individual since is the inference model that takes all the features extractors as inputs. Second, it is not easy to identify the relationship between different ’s parameters and combine them to yield a global .
To deal with these challenges, we adopt a matched averaging idea proposed by Yurochkin et al. (2019); Wang et al. (2020)
, which is used in the HFL setting to dynamically match the neurons of local models to build a global model. More specifically, it is assumed that the global
and local are related through a linear mapping , where such a mapping should be dynamically optimized. One special case of is a matching matrix containing only one nonzero entry at each row. Using such a matching matrix ensures that each neuron in the global model is a linear combination of a set of most closely related neurons in the local models. While the idea of model matching has been explored in the recent works Yurochkin et al. (2019); Wang et al. (2020), here we design a special matching strategy specifically designed for hybrid FL. Hence, the main contribution of this work is the overall problem formulation, and the model matching is just one integral component of it.By using the model matching strategy, the two regularizers in (1) can be expressed as following:
(4) 
where measures the distance between the models. It is important to note that the matching patterns ’s have to be optimized as well. This leads to the following Hybrid FL problem:
(5)  
s.t. 
In this formulation, we jointly minimize the classification loss and the consensus loss, and is used to balance between the two losses. When is very small, the clients focus on training the local models, which can differ from each other. When is large, the emphasis is put on learning an accurate global model by integrating local information.
3 The Proposed Algorithm
We propose an algorithm for solving (5). The problem contains parameter blocks and , the global parameters , and the global parameters and , so we can update each of them given the others. The problem related to the local parameters , , is:
(6) 
the problem related to global feature extractors ’s is:
(7) 
and the third block related to the global inference model is:
(8) 
In view of its block structure, we propose a block coordinate descent type algorithm called Hybrid Federated Matched Averaging (HyFEM) to solve this problem, which is summarized in Algorithm 1.
In each iteration, the local clients first fix and , optimize the local objective function (6) with gradientbased local solvers such as SGD. The form of this local problem is similar to FedProx Li et al. (2020), so we call it HyFEMProx. Then the server aggregates the local feature extractors and optimizes (7). For some common choice of distance function, such as the square of Euclidean norm, the problem has a closedform solution. Finally, the server aggregates the local inference models and optimizes the global model matching problem (8). This subproblem can be optimized by another iterative procedure: a) randomly pick an index and apply the Hungarian matching algorithm to find fixing the other ’s; b) update by fixed
. After few rounds of update, we obtain the matched global model and the corresponding matching pattern. In practice, the matching is performed layer by layer. Dummy neurons with zero weights are padded to match the size of different models. Due to space limitation, we have to skip some technical details.
We have the following convergence results about the proposed algorithm.
Claim 1.
Suppose that for each , has Lipschitz continuous gradient w.r.t. and ; further assume that has Lipschitz gradient w.r.t. each of its argument. Suppose that has a fixed dimension and its size is bounded. Then Algorithm 1 converges to a firstorder stationary solution for problem (5).
This claim can be proved by observing that the proposed algorithm can be viewed as the classical blockcoordinate gradient descent (BCGD) algorithm, so classical result can apply Zeng et al. (2019); Razaviyayn et al. (2013); It can also be extended to the case where the and steps are not solved to global min, but to some approximate stationary solution of (6). Due to the space limitation, we omit the detailed proofs.
Remark 1. We highlight the merits of the proposed approach: 1) Unlike the typical VFL formulations Liu et al. (2019); Chen et al. (2020), our approach keeps labels at the clients. Thus, the local problems are fully separable. There is no sample synchronization need during local updates; 2) By utilizing the proposed merging technique, we can generate a global model at the server, which makes use of the fullfeatured data. This makes the testing stage flexible: the clients can process data with either partial feature (by using its local parameters ), or the full data set (by requesting from the server).
Remark 2. Our formulation (5) and our proposed algorithm naturally reduce to existing FL models. For example, in the extreme case, all the local clients have all the features and the matching patterns are fixed, and the distance function is chosen as norm square (i.e., ), then the HyFEM becomes FedProx if the local problem is (6).
Remark 3. In practice, we perform inexact minimization for the local problem (6). As an alternative, locally we can ignore the regularizer and optimize the following local problem for a few iterations:
(9) 
We name this alternative as HyFEMAvg. Compare with HyFEMProx, the gradient estimation is easier, and it requires much less memory for the local clients.
3.1 Experiment Settings
To evaluate the proposed algorithms, we conducted experiments using the ModelNet40 data set for multiview object classification. It has a total of 40,000 samples from 40 classes. Each sample consists of 12 views from different angles, which are features of an object. We use
clients during the experiment, and each client has data from only partial views in some of the classes. We use the convolutional layers of Resnet34 for feature extraction for HyFEM and use an MLP with one hidden layer for local inference. For comparison, we also trained the model using the entire data set with all the features, which we label “centralized” in the figures.
Algorithm performance on training neural networks on the ModelNet40 data set with total 4 views and each client trained with 3 views on 30,000 data points.
In the training phase, the local clients train with their partial data and partial features. In the testing phase, the local clients will test their local model with partial features on all the samples, and we average over all clients to obtain the averaged local accuracy. The global accuracy is computed using the matched global model on all samples with full features.
We uniformly set the total communication rounds and local update step with HyFEMAvg with minibatch size on local clients during the training. The initial learning rate is set to be and decays by for every 8 rounds of communication. In the case when , we select four views () as the full feature. Each client has 30 classes of data, and has views. Therefore, none of the clients has full sample nor full features and the data distribution is heterogeneous.
Figure 3 shows the result when . In this setting, the local model and the global model trained with HyFEM can behave well on the data set and even obtained higher accuracy than the centralized training. We believe this is due to the clients’ data heterogeneity, which helps prevent the model from overfitting.
In the more complicated case when , we set the total number of views to . Each client only has classes and views, and the way we divide data and features are illustrated in Figure 2. From the figure, we can see that of the data have never been used by any local clients during training and the local data sets are more heterogeneous than the case when . In addition, we set the penalty weight and for HyFEMProx to understand how this parameter affects the global and the local accuracy. Figure 4 shows the testing loss and accuracy of the algorithms. We can see that the federated trained models behave worse than the centralized trained model, which is predictable because of the high data heterogeneity and the missing data. We can also observe that balance between the global and the local accuracy. When is small (), the local accuracy is high, and the global accuracy is low, and when becomes larger, local accuracy drops while the global accuracy improves. This is intuitive since by using larger parameter , we put more emphasis on the global model integration.
References
 VAFL: a method of vertical asynchronous federated learning. arXiv preprint arXiv:2007.06081. Cited by: §1, §1, §3, 2.
 Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677. Cited by: §1.
 Advances and open problems in federated learning. arXiv preprint:1912.04977. Cited by: §1.
 Mime: mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606. Cited by: §1.
 Scaffold: stochastic controlled averaging for ondevice federated learning. arXiv preprint arXiv:1910.06378. Cited by: §1.
 Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1, §1, §1.
 Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine 37 (3), pp. 50–60. Cited by: §1, §1, §3.
 A communication efficient vertical federated learning framework. arXiv preprint arXiv:1912.11187. Cited by: §1, §1, §3, 2.

Privacypreserving tensor factorization for collaborative health data analysis
. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1291–1300. Cited by: §1.  Communicationefficient learning of deep networks from decentralized data. In Proc. Intl. Conf. Artificial Intell. and Stat., Fort Lauderdale, FL, pp. 1273–1282. Cited by: §1, §1.
 A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization 23 (2), pp. 1126–1153. Cited by: §3.
 Federated learning with matched averaging. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
 Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252–7261. Cited by: §1, §2.

Global convergence of block coordinate descent in deep learning
. In International Conference on Machine Learning, pp. 7313–7323. Cited by: §3.