1 Introduction
Machine learning technology has been introduced into a variety of services and has become a cornerstone of industrial society. Machine learning models require a large amount of data in most cases, and sometimes these data contain private information. Federated Learning (FL) is a technique to lessen the risk of privacy leakage while preserving the benefits of distributed machine learning systems [Konecny16, McMahan2017, Bonawitz19], in which agents, such as smartphones and automobiles, cooperatively train a machine learning model by communicating with a central server without disclosing private information.
While FL has received huge attention in recent years and has already been deployed in several services [hard2018federated, wwdc2019, leroy2019federated], there are still many technical challenges [mcmahan2021advances]. One of the biggest challenges is data heterogeneity among agents. The datasets of FL are constituted of the local data of a set of agents, and their distributions are typically different, i.e. nonIdentically and Independently Distributed
(nonIID). The Stochastic Gradient Descent (SGD) method
[Robbins51, Kiefer52] has been central in scaling learning algorithms to large models, such as Deep Neural Networks (DNN). Although it is well known that FedAve, the most basic federated SGD algorithm, converges under certain conditions, it is still difficult to guarantee accuracy and convergence in general [zhao2018federated, li2020convergence]. Various approaches, including some clustering techniques [sattler2020clustered, ghosh2020efficient], have already been proposed to tackle this problem. However, those methods do not fit the case where each agent has data which is heterogeneously distributed. To explain this aspect, let us introduce a specific use case. A smartphone system decides whether to connect to a specific WiFi access point within the signal range by predicting the connection quality. Even for the same smartphone, different WiFi access points result in different connection quality due to differences in the properties of access points. In other words, the connection quality depends on not only the devices but also the access points. In the above example, each agent (smartphone) contains data in a set of entities (WiFi access points) and each entity contains a set of connection records. We call this kind of structure a hierarchical data structure. A hierarchical data structure is encountered in many contexts, such as patienthospital data in medical sciences and customerproduct data in recommendation systems. Developing a single model to perform predictions on a nonIID hierarchical dataset is complex. Especially in the case of neural networks, for which the generalization properties are not deeply understood, we have a risk of overfitting issues [Neyshabur17, Nagarajan19]. Furthermore, it can lead to prohibitive computational costs on edge devices, as well as slow convergence, leading in turn to even higher costs in communication and edge computing.To handle nonIID hierarchical data efficiently, we built a pure Bayesian inference model called Hierarchical Latent Class Regression (HLCR) and FedHLCR, which is an extension of HLCR to FL, by combining linear regressors in a hierarchical mixture manner inspired by Latent Dirichlet Allocation (LDA), which works well for hierarchical data. In addition, we propose an optimization algorithm applying Collapsed Gibbs Sampling (CGS) and guarantee significant acceleration in the convergence of HLCR. The key point is to cluster data in each agent based on the entity and to train a simple model per cluster without disclosing any private information. In this paper, we only focus on the mixture of linear models since it effectively avoids the overfitting problem.
Our contributions: To the best of our knowledge, this is the first research on the regression problem for hierarchical data in a Federated Learning setting.

We establish a purely probabilistic mixture model, called Hierarchical Latent Class Regression (HLCR), which is mixed by linear regression models. The hierarchical structure allows HLCR to handle hierarchical data very well.

We propose an efficient Collapsed Gibbs Sampling algorithm for inferring an HLCR model with fast convergence.

We extend the Collapsed Gibbs Sampling algorithm of HLCR to the Federated Learning (FedHLCR), preventing sensitive information from each agent from being disclosed.
2 Related Works
Federated Learning: FL [Konecny16, McMahan2017] is a powerful method to train a machine learning model in the distributed setting; however, it has a variety of technical challenges, and nonIID data distribution among agents is a central problem. Many researchers have studied the performance of FedAve and similar FL algorithms, and extensions of SGD, on nonIID settings [Lietal20, zhao2018federated, li2020convergence, karimireddy2020scaffold, ahmed2019, wu2021node]. [zhao2018federated] shows that the accuracy and convergence speed of FedAve are reduced on nonIID data compared with IID data, and [Lietal20] proposes a method, called FedProx, which adds a proximal term to the objective function to prevent each local model from overfitting data in each agent. Several studies suggest to use multiple models to address the nonIID problem [smith2017MOCHA, sattler2020clustered, zantedeschi2020fully, fallar2020, jiang2019improving, mansour2020three, ghosh2020efficient], and some researchers adopt multitask learning [smith2017MOCHA, sattler2020clustered, zantedeschi2020fully] or meta learning [fallar2020, jiang2019improving] to deal with multiple targets. Another approach applies the latent class problem to FL for clustering agents [ghosh2020efficient] and it assumes that each agent has data in one entity, but our method is more general in that each agent is considered to have data in multiple entities.
Mixture Models: Probabilistic Mixture Models have been studied for more than 100 years [newcomb1886generalized, 10.2307/90707]
. The classical Mixture Model is the Gaussian Mixture Model (GMM)
[bishop2006pattern], which can be inferred with an expectation–maximization (EM) algorithm and has been widely used in clustering. Our proposal in this paper is inspired by two classes of Mixture Models. The first is Latent Class Regression (LCR)
[wedel1994review, Grun2008], or Mixture of Regression. The second class is topic models which generally handle documentword data, including Latent Dirichlet Allocation (LDA) [blei2003latent] and its variants [wallach2009rethinking, blei2006correlated, teh2006hierarchical].3 Mixture Models
In statistics, a mixture model is a probabilistic model composed of
simple models, which are also called clusters. Each of these models has the same probability distribution but with different parameters. The probability density function of the
th model is denoted as , where is the parameter of the th model. Hence, the density of the mixture model can be denoted as , where the mixture weight is a simplex; i.e., ,. In a Bayesian setting, the mixture weights and parameters are regarded as unknown random variables and will be inferred from observations. Each observation generated from mixture model
can be equivalently generated by the following two steps: 1) a latent cluster label is sampled from categorical distribution with parameter ; then 2) an observation is sampled from the corresponding model .Notation:
In this paper, we use lowercase letters for scalars, bold lowercase letters for vectors, and bold uppercase letters for matrices. The set
is denoted as . We summarize all notations used in the paper in Table 1.Notation  Description  Notation  Description 

integer, the dimension of features  integer, the number of clusters  
positive scalar, prior of distribution  , ,  simplex, categorical distribution  
dimensional coefficient vector  simplex, categorical distribution  
the number of data  the number of data in cluster  
one observed record  cluster label ()  
cluster labels of all data  cluster labels of data except  
,  feature (column) vector of a record  ,  target value of a record 
feature matrix of all records  target vector of all records  
feature matrix of records ;  target vector of records ;  
i.e.,  i.e.,  
feature matrix of records except  target vector of records except  
feature matrix of records in cluster  target vector of records in cluster  
matrix  vector  
Latent Class Regression (LCR) Latent Class Regression [wedel1994review, Grun2008] (Figure 2
(a)) is a supervised learning model that can be defined as a mixture of regression models
, in which is the parameter () of theth model. In different unsupervised learning techniques, like the GMM, each model is a probability on a target
given an observation where the model could be any regression function. If, for instance, is a linear regression model, the distribution of can be denoted as , where is the coefficient of the th linear model andis the variance of the white noise. As with the GMM, LCR can be trained by the EM algorithm.
Latent Dirichlet Allocation (LDA) One of the most famous topic models is Latent Dirichlet Allocation (LDA) [blei2003latent] (Figure 2(b)). A topic model generally handles a set of documents, each of which is composed of a set of words from a dictionary (the set of all possible words). This kind of documentword data can be regarded as a hierarchical structure with two layers. In LDA, each topic is defined as a Categorical distribution over words in a dictionary with probability . Although EM cannot be directly used for inferring LDA, various inference techniques have been proposed, such as Variational Bayesian (VB) inference, Collapsed Variational Bayesian (CVB) inference, and Collapsed Gibbs Sampling (CGS). LDA with an asymmetric Dirichlet prior (ALDA) [wallach2009rethinking] (Figure 2(c)) is similar to general LDA with a symmetric prior. ALDA assumes documents are generated in four steps: 1) an asymmetric Categorical distribution (prior) is sampled from a Dirichlet distribution with a symmetric parameter ; 2) for document , a Categorical distribution is sampled from a Dirichlet distribution with a parameter ; 3) for word in document , a latent cluster label is sampled from ; and 4) the word in document , denoted as , is then sampled from the th topic, . is the prior of all documents, so it can be regarded as a global distribution of topics. ALDA increases the robustness of topic models and can be efficiently inferred by Collapsed Gibbs Sampling.
4 Hierarchical Latent Class Regression
Our goal is to propose a regression model for predicting data with a hierarchical structure. The hierarchical structure considered in this paper is similar to, but more general than, the documentword structure in the topic model. In the WiFi connection example, smartphones (agent) and WiFi access points (entity) are similar to documents and words in topic models, respectively. In other words, the data from each smartphone contains multiple WiFi access points; however, unlike in the documentword structures in which each word is a simple record, each smartphone usually accesses each WiFi access point more than once. We call each connection an event. In this paper, we generally call this threelayer hierarchical structure an agententityevent structure. Another difference is that our problem is a regression problem, so each event is composed of a feature vector, denoting the condition of the connection, and a target value, denoting the quality of the event.
4.1 Model
In order to handle the regression problem on the agententityevent hierarchical data, we propose a Hierarchical Latent Class Regression (HLCR) model by introducing the mechanism of hierarchical structure in ALDA to LCR. It is assumed that there are agents, each of which is labeled by . Each agent contains entities, each of which is labeled by and contains events. Moreover, each of the events is composed of an dimensional column vector and a target scalar (). In the mixture models introduced in the previous section, each record corresponds to a latent variable , denoting its cluster. In our HLCR, however, we assume that all events corresponding to the same agententity pair belong to the same cluster, since they are assumed to have similar behavior. More specifically, all s and s () for any particular and share the same cluster label . We denote the set of all and the set of all () as an matrix , each row of which is , and dimensional column vector .
The graphical model of HLCR (Figure 2) assumes that data are generated in the following steps: 1) each topic samples an dimensional vector
from a normal distribution,
; 2) the global Categorical distribution is sampled from a Dirichlet distribution with a symmetric parameter, ; 3) for agent , a Categorical distribution is sampled from a Dirichlet distribution, ; 4) for entity in agent , a latent cluster label is sampled from a Categorical distribution, ; and 5) for event in entity in agent , the target value is then generated by adding a Gaussian noise to corresponding to the th topic, . The prior distribution of in HLCR is the same as that in ALDA, and the probability of in HLCR is similar to that in LCR except for two differences. First, events in HLCR share one , while one word, , in LCR owns one . Second, we add a prior for in HLCR for deriving the Collapsed Gibbs Sampling algorithm, while in LCR has no prior since it is not necessary for an EM algorithm.4.2 Inference
In each iteration of Collapsed Gibbs Sampling, a new cluster label for every particular and is sampled sequentially, with all other cluster labels, denoted as , being fixed. Hence, we need to evaluate the conditional probability , in which all other random variables, , and
, are integrated out. From Bayes’ theorem, we have
(1) 
where denotes .
The first part in (1) can be computed by integrating out and from , which is equal to . Since the generative process of is completely the same as that in [wallach2009rethinking], we obtain the same result as follows.
(2) 
where denotes the number of entities in agent , whose cluster labels are equal to , i.e., , and denotes that number in all agents, i.e., .
The second part , on the other hand, is equal to . The following theorem can be used to compute this probability.
Theorem
For any particular and , let and be the data corresponding to and . For , let and be the data whose cluster labels are equal to except and . Then, the conditional probability of given the new cluster label is
(3) 
where obeys a normal distribution as the following.
(4)  
(5)  
(6) 
It is therefore straightforward that
(7) 
Furthermore, in order to evaluate this normal distribution, we have to compute the inverse of ()matrix , where the computational cost is generally . By combining the Woodbury matrix identity with (7), we obtain the following recursive rules that reduce the computational cost to .
(8)  
(9) 
Let , and , which can be computed from the data whose current labels are equal to . Hence, for any particular and , there are two cases: 1) if the current , and do NOT contain and , then and ; and 2) if the current , and contain and , then and . Then, one can recursively compute and with (9) and (7). By obtaining and , we can recursively compute and () with (8) and (7), and by substituting them into (4), we can compute the mean and variance of the normal distribution, and then evaluate the conditional probability of (4). Substituting this result and the prior of (2) into (1), we can compute the conditional probability of (). Algorithm 1 shows how to sample the new label for specific and .
We compute matrices and vectors and pass the result to Algorithm 1. For each , is initialized to the prior of in Line 3. and are updated in Lines 67, then the probability of is computed in Line 9. Next, becomes the conditional probability of (1). Finally, a is sampled from a categorical distribution and is returned.
4.3 Training Algorithm
Using Algorithm 1, we can easily implement the training algorithm of HLCR in Algorithm 2. The algorithm is started by randomly initializing all latent variables s ( and ) in Lines 13, then and are initialized based on these latent variables in Lines 56. In each iteration , for each agent and each entity , we remove and from and corresponding to the current in Lines 1012, then sample a new latent variables with function SampleLabel in Algorithm 1 in Line 13, and finally add and to and corresponding to this new in Lines 1416.
4.4 Prediction
After training the data with Algorithm 2, we obtain cluster label s for all agents and entities , , and for all . Suppose there are data for a particular agent and entity . If we get a new feature , its target can be predicted with HLCR from the variables obtained from the algorithm. HLCR predicts the target value in two steps. 1) Select a proper cluster label for based on previous data and with Algorithm 1. In fact, if there have been previous data of and in the training set, we can directly use the corresponding latent variable without sampling it again. Then 2) using the variables corresponding to , i.e., , and , we can predict the target value as follows.
Theorem
For any particular and , let and be all data, be the cluster label corresponding to and , and , and be the variables trained by Algorithm 2. Then, given a new feature , the conditional expected value of the target is
(10) 
This theorem indicates that the prediction result is the same as the solution of the Ridge regression.
Discussion
HLCR is a mixture of linear regressions. Since HLCR is more expressive, it can be used to efficiently predict noisy data with various structures. The mechanism of HLCR is different from that of general regression models, such as Deep Neural Networks (DNN). A regression model is generally a regression function that maps a feature to a target, and hence, given the same feature, the predicted target value will be unique. However, in HLCR, different predictions can be made for identical input features if they originate from agententities that belong to different clusters (models). This property makes our HLCR fit hierarchical data better than general regression models do.
5 Federated Hierarchical Latent Class Regression
Federated Hierarchical Latent Class Regression (FedHLCR) is shown in Algorithm 3. In the beginning, the server initializes and without any data in Line 2. In the beginning of each iteration, each agent receives the global model and trained in previous iteration from the server (Line 6), and trains its local data in parallel based on the global model (Lines 710), then sends the training results to the server (Line 11), and finally, the server accumulates the local training results received from agents to the global model, and , at the end of the iteration (Line 12). In Federated Learning, not all agents participate the training process in each iteration and no one can guarantee that the agent set in each iteration does not change (Line 5). In order to ensure the convergence of the algorithm, we smoothly update the model by using a learning rate in Line 16.
There are several differences between FedHLCR in Algorithm 3 and centralized HLCR in Algorithm 2. First, HLCR updates intermediate data and whenever is sampled, while each agent in FedHLCR trains its data and () independently and the server only updates and once in each iteration. This process saves communication costs and efficiently protects privacy. Second, the data for each entity in each agent in HLCR are trained once in each iteration, and and are updated in summation, while only a part of agents in FedHLCR participate the training process and the server updates the model in a smooth way (Line 16). This approach makes FedHLCR smoothly converge. Finally, HLCR removes and from and before sampling for and , while FedHLCR does not remove them before sampling since the weight for each record decreases after several update steps (Line 16), thus this simplifies our algorithm.
6 Experiments
We use synthetic and real data in order to systematically test the FedHLCR algorithm performance on heterogeneous datasets. All experiments are performed on a commodity machine with two Intel®Xeon®CPUs E52690 v3 @ 2.60GHz and 64GB of memory.
Sysnthetic data
The synthetic dataset SynthHLCR is generated according to the model in Figure 2. We control the average number of entities per agent by and the average number of events per entity by . We generate a series of such datasets with different noise level , cluster number , and average category number per agent , and the dataset is generated using 128 agents and 128 entities for various choices of and . We then perform a 5fold crossvalidation on the generated data and train the FedHLCR algorithm with the same parameter choice as for the generation process. We observe that the FedHLCR converges in less than 10 iterations. The results reported in Figure 3 (a) show that the converged model at high noise is optimal when agents have 20 datapoints on average. For , , , and high
, the dataset has a signaltonoise ratio too low to correctly characterize the model, leading to agententity misclustering. When
is small, we observe a deviation to the optimal solution which is caused by relatively big prior .Federated simulation
One aspect of Federated Learning is the agent failure to report on the one hand, and the deliberate agent sampling by the orchestrator on the other. We simulate this by randomly sampling a certain ratio of agents among all agents in each iteration. In Figure 3 (b) and (c), we show the accuracy (MSE) of FedHLCR for various fractions of selected agents per iteration and choices of learning rate . Smaller numbers of selected agents cause larger variations in training data distribution in each iteration, thereby causing greater instability and worse performance; therefore, choosing a proper is necessary to ensure convergence. Figure 3 (b) shows the convergence processes with respect to different s. Here, we observe smooth convergence of the algorithm when . When , we obtain a faster convergence. If is increased to 0.5, the accuracy increases faster in the beginning and the convergence is no longer guaranteed, and if becomes even bigger (0.75), we can see the accuracy is highly unstable. Figure 3 (c) illustrates the relationship between and accuracy with respect to different fraction of selected agents. It is shown that, for each fraction, there exists an optimal value of for which a good performance can be achieved in a limited number of iterations.
Real data
We setup a characterrecognition experiment on the FEMNIST dataset [Caldasetal18, Cohenetal17]
as packaged in Federated Tensorflow
[tff]. This dataset stores images of handwritten characters together with an identifier of the writer (agent). Let us consider building a character recognition system using only a fraction of writers among the FEMNISTdataset. We use a Convolutional Neural Network (CNN) taken from
[McMahan2017] ^{1}^{1}1The CNN has two 5x5 convolution layers of 32 and 64 channels, each followed with 2x2 max pooling, a fully connected ReLu layer with a size of 512, and a final softmax output layer with a size of either 10, 36 or 62 for
FEMNISTdigits, FEMNISTnocaps, and FEMNISTfull respectively. which we train on this subset of writers. The problem we try to solve is to use the pretrained CNN for seperating digits from letters in the whole FEMNIST dataset. Binary classification can in practice be tackled using regression on a {0, 1} target. First, for each subset with different fraction (), we train a CNN model on the subset. Then, we consider the following cases: 1) Baseline (CNNonly): The of the vector output by the CNN is used to predict the character and decides on the task (letter vs. digit); 2) LR: A Linear Regression is added after the CNN to perform the binary classification task; and 3) : A HLCR with cluster number is added after the CNN to perform the binary classification task. Here, we run with writers as agents, using either or clusters and a single entity per agent. Finally, we compare the results of “” and “” to those of “Baseline” and “LR”. The results are reported in Table 2. We see that the CNN training time on our CPUonly commodity machine increases linearly with the number of selected agents (around minutes for 1% agents for 100 iterations), while training the HLCR model after the CNN takes less than six minutes on the whole dataset.AUC (%)  Computation Time (min.)  

Baseline  LR  Baseline  LR  
1  93.55  93.60  96.47  96.51  30  30.5  32.1  35.1 
5  94.55  94.65  97.16  97.16  168  168.5  170.7  173.0 
10  94.70  94.77  97.21  97.25  334  334.5  336.7  339.9 
20  94.74  94.77  97.17  97.20  773  773.5  776.0  778.0 
(%) denotes the fraction of data used in the training of the CNN. The Computation Time is the elapsed time on our CPUonly commodity machine. CNN training is limited to 100 FL rounds using FedAvg and 5 local epochs.
7 Conclusion
In this paper, we proposed a novel probabilistic model to deal with noisy, hierarchical and tabular datasets, named HLCR. By applying the Collapsed Gibbs Sampling technique, we efficiently inferred an HLCR model and theoretically guaranteed its convergence. Furthermore, we provided an HLCR algorithm in Federated Learning, called FedHLCR, for preserving the privacy of agents. Finally, the experimental results showed that the algorithm offers both fast convergence and good robustness to overfitting even in nonIID datasets. The immediate future work is to extend the model of each cluster in FedHLCR to a nonlinear model.
Broader Impact
We consider the regression problem for the data held by each agent in Federated Learning. We assume that each agent contains data (events) belonging to different entities, and such agententityevent hierarchical data widely exist in various realistic application scenarios, such as smartphones, IoT devices and medical data. In order to analyze and make prediction using such hierarchical data, we propose a Federated Hierarchical Latent Class Regression (FedHLCR) model, which is a mixture of linear regression, so it has richer expressiveness than simple linear regression. With its hierarchical mixture approach, FedHLCR can handle hierarchical data fairly well and given its similar complexity compared to complex models, such as Deep Neural Networks, it can be more efficiently trained by Collapsed Gibbs Sampling. We expect that it will be widely used in various applications. On the other hand, each agent in FedHLCR sends its local training result to the server. This information may cause a privacy risk when an agent only contains very few data. We suggest that FedHLCR application protects the personal information by introducing other privacyenhancing technologies, such as differential privacy, into FedHLCR.
References
8 Appendix
8.1 Proofs
Proof (Proof of Theorem 4.2)
Although s here are NOT independent, generally we have
(11) 
Let and denote matrix and vector . Then, we have and . With Bayes’ theorem,
(12)  
(13)  
(14)  
(15)  
(17)  
(18) 
where
(19)  
(20) 
By marginalizing out, we can get
(21)  
(22)  