Log In Sign Up

Federated Latent Class Regression for Hierarchical Data

by   Bin Yang, et al.

Federated Learning (FL) allows a number of agents to participate in training a global machine learning model without disclosing locally stored data. Compared to traditional distributed learning, the heterogeneity (non-IID) of the agents slows down the convergence in FL. Furthermore, many datasets, being too noisy or too small, are easily overfitted by complex models, such as deep neural networks. Here, we consider the problem of using FL regression on noisy, hierarchical and tabular datasets in which user distributions are significantly different. Inspired by Latent Class Regression (LCR), we propose a novel probabilistic model, Hierarchical Latent Class Regression (HLCR), and its extension to Federated Learning, FEDHLCR. FEDHLCR consists of a mixture of linear regression models, allowing better accuracy than simple linear regression, while at the same time maintaining its analytical properties and avoiding overfitting. Our inference algorithm, being derived from Bayesian theory, provides strong convergence guarantees and good robustness to overfitting. Experimental results show that FEDHLCR offers fast convergence even in non-IID datasets.


page 1

page 2

page 3

page 4


Coded Computing for Federated Learning at the Edge

Federated Learning (FL) is an exciting new paradigm that enables trainin...

FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis

Federated Learning (FL) is an emerging learning scheme that allows diffe...

Federated learning with hierarchical clustering of local updates to improve training on non-IID data

Federated learning (FL) is a well established method for performing mach...

FedDAR: Federated Domain-Aware Representation Learning

Cross-silo Federated learning (FL) has become a promising tool in machin...

Peer-to-peer Federated Learning on Graphs

We consider the problem of training a machine learning model over a netw...

Rewarding High-Quality Data via Influence Functions

We consider a crowdsourcing data acquisition scenario, such as federated...

The Impact of Data Distribution on Fairness and Robustness in Federated Learning

Federated Learning (FL) is a distributed machine learning protocol that ...

1 Introduction

Machine learning technology has been introduced into a variety of services and has become a cornerstone of industrial society. Machine learning models require a large amount of data in most cases, and sometimes these data contain private information. Federated Learning (FL) is a technique to lessen the risk of privacy leakage while preserving the benefits of distributed machine learning systems [Konecny16, McMahan2017, Bonawitz19], in which agents, such as smartphones and automobiles, cooperatively train a machine learning model by communicating with a central server without disclosing private information.

While FL has received huge attention in recent years and has already been deployed in several services [hard2018federated, wwdc2019, leroy2019federated], there are still many technical challenges [mcmahan2021advances]. One of the biggest challenges is data heterogeneity among agents. The datasets of FL are constituted of the local data of a set of agents, and their distributions are typically different, i.e. non-Identically and Independently Distributed

(non-IID). The Stochastic Gradient Descent (SGD) method 

[Robbins51, Kiefer52] has been central in scaling learning algorithms to large models, such as Deep Neural Networks (DNN). Although it is well known that FedAve, the most basic federated SGD algorithm, converges under certain conditions, it is still difficult to guarantee accuracy and convergence in general [zhao2018federated, li2020convergence]. Various approaches, including some clustering techniques [sattler2020clustered, ghosh2020efficient], have already been proposed to tackle this problem. However, those methods do not fit the case where each agent has data which is heterogeneously distributed. To explain this aspect, let us introduce a specific use case. A smartphone system decides whether to connect to a specific Wi-Fi access point within the signal range by predicting the connection quality. Even for the same smartphone, different Wi-Fi access points result in different connection quality due to differences in the properties of access points. In other words, the connection quality depends on not only the devices but also the access points. In the above example, each agent (smartphone) contains data in a set of entities (Wi-Fi access points) and each entity contains a set of connection records. We call this kind of structure a hierarchical data structure. A hierarchical data structure is encountered in many contexts, such as patient-hospital data in medical sciences and customer-product data in recommendation systems. Developing a single model to perform predictions on a non-IID hierarchical dataset is complex. Especially in the case of neural networks, for which the generalization properties are not deeply understood, we have a risk of overfitting issues [Neyshabur17, Nagarajan19]. Furthermore, it can lead to prohibitive computational costs on edge devices, as well as slow convergence, leading in turn to even higher costs in communication and edge computing.

To handle non-IID hierarchical data efficiently, we built a pure Bayesian inference model called Hierarchical Latent Class Regression (HLCR) and FedHLCR, which is an extension of HLCR to FL, by combining linear regressors in a hierarchical mixture manner inspired by Latent Dirichlet Allocation (LDA), which works well for hierarchical data. In addition, we propose an optimization algorithm applying Collapsed Gibbs Sampling (CGS) and guarantee significant acceleration in the convergence of HLCR. The key point is to cluster data in each agent based on the entity and to train a simple model per cluster without disclosing any private information. In this paper, we only focus on the mixture of linear models since it effectively avoids the overfitting problem.

Our contributions: To the best of our knowledge, this is the first research on the regression problem for hierarchical data in a Federated Learning setting.

  • We establish a purely probabilistic mixture model, called Hierarchical Latent Class Regression (HLCR), which is mixed by linear regression models. The hierarchical structure allows HLCR to handle hierarchical data very well.

  • We propose an efficient Collapsed Gibbs Sampling algorithm for inferring an HLCR model with fast convergence.

  • We extend the Collapsed Gibbs Sampling algorithm of HLCR to the Federated Learning (FedHLCR), preventing sensitive information from each agent from being disclosed.

2 Related Works

Federated Learning: FL [Konecny16, McMahan2017] is a powerful method to train a machine learning model in the distributed setting; however, it has a variety of technical challenges, and non-IID data distribution among agents is a central problem. Many researchers have studied the performance of FedAve and similar FL algorithms, and extensions of SGD, on non-IID settings [Lietal20, zhao2018federated, li2020convergence, karimireddy2020scaffold, ahmed2019, wu2021node]. [zhao2018federated] shows that the accuracy and convergence speed of FedAve are reduced on non-IID data compared with IID data, and [Lietal20] proposes a method, called FedProx, which adds a proximal term to the objective function to prevent each local model from overfitting data in each agent. Several studies suggest to use multiple models to address the non-IID problem [smith2017MOCHA, sattler2020clustered, zantedeschi2020fully, fallar2020, jiang2019improving, mansour2020three, ghosh2020efficient], and some researchers adopt multitask learning [smith2017MOCHA, sattler2020clustered, zantedeschi2020fully] or meta learning [fallar2020, jiang2019improving] to deal with multiple targets. Another approach applies the latent class problem to FL for clustering agents [ghosh2020efficient] and it assumes that each agent has data in one entity, but our method is more general in that each agent is considered to have data in multiple entities.

Mixture Models: Probabilistic Mixture Models have been studied for more than 100 years [newcomb1886generalized, 10.2307/90707]

. The classical Mixture Model is the Gaussian Mixture Model (GMM)


, which can be inferred with an expectation–maximization (EM) algorithm and has been widely used in clustering. Our proposal in this paper is inspired by two classes of Mixture Models. The first is Latent Class Regression (LCR)

[wedel1994review, Grun2008], or Mixture of Regression. The second class is topic models which generally handle document-word data, including Latent Dirichlet Allocation (LDA) [blei2003latent] and its variants [wallach2009rethinking, blei2006correlated, teh2006hierarchical].

3 Mixture Models

In statistics, a mixture model is a probabilistic model composed of

simple models, which are also called clusters. Each of these models has the same probability distribution but with different parameters. The probability density function of the

-th model is denoted as , where is the parameter of the -th model. Hence, the density of the mixture model can be denoted as , where the mixture weight is a -simplex; i.e., ,

. In a Bayesian setting, the mixture weights and parameters are regarded as unknown random variables and will be inferred from observations. Each observation generated from mixture model

can be equivalently generated by the following two steps: 1) a latent cluster label is sampled from categorical distribution with parameter ; then 2) an observation is sampled from the corresponding model .


In this paper, we use lowercase letters for scalars, bold lowercase letters for vectors, and bold uppercase letters for matrices. The set

is denoted as . We summarize all notations used in the paper in Table 1.

Notation Description Notation Description
integer, the dimension of features integer, the number of clusters
positive scalar, prior of distribution , , -simplex, categorical distribution
-dimensional coefficient vector simplex, categorical distribution
the number of data the number of data in cluster
one observed record cluster label ()
cluster labels of all data cluster labels of data except
, feature (column) vector of a record , target value of a record
feature matrix of all records target vector of all records
feature matrix of records ; target vector of records ;
i.e., i.e.,
feature matrix of records except target vector of records except
feature matrix of records in cluster target vector of records in cluster
-matrix -vector
Table 1: Notations
Figure 1: Mixture Models
Figure 2: Hierarchical Latent Class Regression

Latent Class Regression (LCR) Latent Class Regression [wedel1994review, Grun2008] (Figure 2

(a)) is a supervised learning model that can be defined as a mixture of regression models

, in which is the parameter () of the

-th model. In different unsupervised learning techniques, like the GMM, each model is a probability on a target

given an observation where the model could be any regression function. If, for instance, is a linear regression model, the distribution of can be denoted as , where is the coefficient of the -th linear model and

is the variance of the white noise. As with the GMM, LCR can be trained by the EM algorithm.

Latent Dirichlet Allocation (LDA) One of the most famous topic models is Latent Dirichlet Allocation (LDA) [blei2003latent] (Figure 2(b)). A topic model generally handles a set of documents, each of which is composed of a set of words from a dictionary (the set of all possible words). This kind of document-word data can be regarded as a hierarchical structure with two layers. In LDA, each topic is defined as a Categorical distribution over words in a dictionary with probability . Although EM cannot be directly used for inferring LDA, various inference techniques have been proposed, such as Variational Bayesian (VB) inference, Collapsed Variational Bayesian (CVB) inference, and Collapsed Gibbs Sampling (CGS). LDA with an asymmetric Dirichlet prior (ALDA) [wallach2009rethinking] (Figure 2(c)) is similar to general LDA with a symmetric prior. ALDA assumes documents are generated in four steps: 1) an asymmetric Categorical distribution (prior) is sampled from a Dirichlet distribution with a symmetric parameter ; 2) for document , a Categorical distribution is sampled from a Dirichlet distribution with a parameter ; 3) for word in document , a latent cluster label is sampled from ; and 4) the word in document , denoted as , is then sampled from the -th topic, . is the prior of all documents, so it can be regarded as a global distribution of topics. ALDA increases the robustness of topic models and can be efficiently inferred by Collapsed Gibbs Sampling.

4 Hierarchical Latent Class Regression

Our goal is to propose a regression model for predicting data with a hierarchical structure. The hierarchical structure considered in this paper is similar to, but more general than, the document-word structure in the topic model. In the Wi-Fi connection example, smartphones (agent) and Wi-Fi access points (entity) are similar to documents and words in topic models, respectively. In other words, the data from each smartphone contains multiple Wi-Fi access points; however, unlike in the document-word structures in which each word is a simple record, each smartphone usually accesses each Wi-Fi access point more than once. We call each connection an event. In this paper, we generally call this three-layer hierarchical structure an agent-entity-event structure. Another difference is that our problem is a regression problem, so each event is composed of a feature vector, denoting the condition of the connection, and a target value, denoting the quality of the event.

4.1 Model

In order to handle the regression problem on the agent-entity-event hierarchical data, we propose a Hierarchical Latent Class Regression (HLCR) model by introducing the mechanism of hierarchical structure in ALDA to LCR. It is assumed that there are agents, each of which is labeled by . Each agent contains entities, each of which is labeled by and contains events. Moreover, each of the events is composed of an -dimensional column vector and a target scalar (). In the mixture models introduced in the previous section, each record corresponds to a latent variable , denoting its cluster. In our HLCR, however, we assume that all events corresponding to the same agent-entity pair belong to the same cluster, since they are assumed to have similar behavior. More specifically, all s and s () for any particular and share the same cluster label . We denote the set of all and the set of all () as an -matrix , each row of which is , and -dimensional column vector .

The graphical model of HLCR (Figure 2) assumes that data are generated in the following steps: 1) each topic samples an -dimensional vector

from a normal distribution,

; 2) the global Categorical distribution is sampled from a Dirichlet distribution with a symmetric parameter, ; 3) for agent , a Categorical distribution is sampled from a Dirichlet distribution, ; 4) for entity in agent , a latent cluster label is sampled from a Categorical distribution, ; and 5) for event in entity in agent , the target value is then generated by adding a Gaussian noise to corresponding to the -th topic, . The prior distribution of in HLCR is the same as that in ALDA, and the probability of in HLCR is similar to that in LCR except for two differences. First, events in HLCR share one , while one word, , in LCR owns one . Second, we add a prior for in HLCR for deriving the Collapsed Gibbs Sampling algorithm, while in LCR has no prior since it is not necessary for an EM algorithm.

4.2 Inference

In each iteration of Collapsed Gibbs Sampling, a new cluster label for every particular and is sampled sequentially, with all other cluster labels, denoted as , being fixed. Hence, we need to evaluate the conditional probability , in which all other random variables, , and

, are integrated out. From Bayes’ theorem, we have


where denotes .

The first part in (1) can be computed by integrating out and from , which is equal to . Since the generative process of is completely the same as that in [wallach2009rethinking], we obtain the same result as follows.


where denotes the number of entities in agent , whose cluster labels are equal to , i.e., , and denotes that number in all agents, i.e., .

The second part , on the other hand, is equal to . The following theorem can be used to compute this probability.


For any particular and , let and be the data corresponding to and . For , let and be the data whose cluster labels are equal to except and . Then, the conditional probability of given the new cluster label is


where obeys a normal distribution as the following.


It is therefore straightforward that


Furthermore, in order to evaluate this normal distribution, we have to compute the inverse of ()-matrix , where the computational cost is generally . By combining the Woodbury matrix identity with (7), we obtain the following recursive rules that reduce the computational cost to .


Let , and , which can be computed from the data whose current labels are equal to . Hence, for any particular and , there are two cases: 1) if the current , and do NOT contain and , then and ; and 2) if the current , and contain and , then and . Then, one can recursively compute and with (9) and (7). By obtaining and , we can recursively compute and () with (8) and (7), and by substituting them into (4), we can compute the mean and variance of the normal distribution, and then evaluate the conditional probability of (4). Substituting this result and the prior of (2) into (1), we can compute the conditional probability of (). Algorithm 1 shows how to sample the new label for specific and .

[1] : -matrix, : -vector, : a set of -matrices, : a set of -vectors, , : sampled new label SampleLabel to compute Refer to (2) , to Refer to (8) compute Refer to (4)

Algorithm 1 Label Sampling

We compute matrices and vectors and pass the result to Algorithm 1. For each , is initialized to the prior of in Line 3. and are updated in Lines 6-7, then the probability of is computed in Line 9. Next, becomes the conditional probability of (1). Finally, a is sampled from a categorical distribution and is returned.

4.3 Training Algorithm

Using Algorithm 1, we can easily implement the training algorithm of HLCR in Algorithm 2. The algorithm is started by randomly initializing all latent variables s ( and ) in Lines 1-3, then and are initialized based on these latent variables in Lines 5-6. In each iteration , for each agent and each entity , we remove and from and corresponding to the current in Lines 10-12, then sample a new latent variables with function SampleLabel in Algorithm 1 in Line 13, and finally add and to and corresponding to this new in Lines 14-16.

[1] : -matrix, : -vector, , HLCR, , , , , , to to randomly initialize to , to to to to Remove data from current cluster (9) SampleLabel, , , , , , , , Algorithm 1 to Add data to new cluster (8)

Algorithm 2 Hierarchical Latent Class Regression (HLCR)

4.4 Prediction

After training the data with Algorithm 2, we obtain cluster label s for all agents and entities , , and for all . Suppose there are data for a particular agent and entity . If we get a new feature , its target can be predicted with HLCR from the variables obtained from the algorithm. HLCR predicts the target value in two steps. 1) Select a proper cluster label for based on previous data and with Algorithm 1. In fact, if there have been previous data of and in the training set, we can directly use the corresponding latent variable without sampling it again. Then 2) using the variables corresponding to , i.e., , and , we can predict the target value as follows.


For any particular and , let and be all data, be the cluster label corresponding to and , and , and be the variables trained by Algorithm 2. Then, given a new feature , the conditional expected value of the target is


This theorem indicates that the prediction result is the same as the solution of the Ridge regression.


HLCR is a mixture of linear regressions. Since HLCR is more expressive, it can be used to efficiently predict noisy data with various structures. The mechanism of HLCR is different from that of general regression models, such as Deep Neural Networks (DNN). A regression model is generally a regression function that maps a feature to a target, and hence, given the same feature, the predicted target value will be unique. However, in HLCR, different predictions can be made for identical input features if they originate from agent-entities that belong to different clusters (models). This property makes our HLCR fit hierarchical data better than general regression models do.

[1] : -matrix, : -vector, , , : learning rate FedHLCR server initialize and to server initialize and agent in parallel receive and from server initialize and to SampleLabel, , , , , , , , Algorithm 1 , send and to server server , server , server , , for

Algorithm 3 Federated Hierarchical Latent Class Regression (FedHLCR)

5 Federated Hierarchical Latent Class Regression

Federated Hierarchical Latent Class Regression (FedHLCR) is shown in Algorithm 3. In the beginning, the server initializes and without any data in Line 2. In the beginning of each iteration, each agent receives the global model and trained in previous iteration from the server (Line 6), and trains its local data in parallel based on the global model (Lines 7-10), then sends the training results to the server (Line 11), and finally, the server accumulates the local training results received from agents to the global model, and , at the end of the iteration (Line 12). In Federated Learning, not all agents participate the training process in each iteration and no one can guarantee that the agent set in each iteration does not change (Line 5). In order to ensure the convergence of the algorithm, we smoothly update the model by using a learning rate in Line 16.

There are several differences between FedHLCR in Algorithm 3 and centralized HLCR in Algorithm 2. First, HLCR updates intermediate data and whenever is sampled, while each agent in FedHLCR trains its data and () independently and the server only updates and once in each iteration. This process saves communication costs and efficiently protects privacy. Second, the data for each entity in each agent in HLCR are trained once in each iteration, and and are updated in summation, while only a part of agents in FedHLCR participate the training process and the server updates the model in a smooth way (Line 16). This approach makes FedHLCR smoothly converge. Finally, HLCR removes and from and before sampling for and , while FedHLCR does not remove them before sampling since the weight for each record decreases after several update steps (Line 16), thus this simplifies our algorithm.

6 Experiments

We use synthetic and real data in order to systematically test the FedHLCR algorithm performance on heterogeneous datasets. All experiments are performed on a commodity machine with two Intel®Xeon®CPUs E5-2690 v3 @ 2.60GHz and 64GB of memory.

Sysnthetic data

The synthetic dataset SynthHLCR is generated according to the model in Figure 2. We control the average number of entities per agent by and the average number of events per entity by . We generate a series of such datasets with different noise level , cluster number , and average category number per agent , and the dataset is generated using 128 agents and 128 entities for various choices of and . We then perform a 5-fold cross-validation on the generated data and train the FedHLCR algorithm with the same parameter choice as for the generation process. We observe that the FedHLCR converges in less than 10 iterations. The results reported in Figure 3 (a) show that the converged model at high noise is optimal when agents have 20 datapoints on average. For , , , and high

, the dataset has a signal-to-noise ratio too low to correctly characterize the model, leading to agent-entity misclustering. When

is small, we observe a deviation to the optimal solution which is caused by relatively big prior .

Federated simulation

One aspect of Federated Learning is the agent failure to report on the one hand, and the deliberate agent sampling by the orchestrator on the other. We simulate this by randomly sampling a certain ratio of agents among all agents in each iteration. In Figure 3 (b) and (c), we show the accuracy (MSE) of FedHLCR for various fractions of selected agents per iteration and choices of learning rate . Smaller numbers of selected agents cause larger variations in training data distribution in each iteration, thereby causing greater instability and worse performance; therefore, choosing a proper is necessary to ensure convergence. Figure 3 (b) shows the convergence processes with respect to different s. Here, we observe smooth convergence of the algorithm when . When , we obtain a faster convergence. If is increased to 0.5, the accuracy increases faster in the beginning and the convergence is no longer guaranteed, and if becomes even bigger (0.75), we can see the accuracy is highly unstable. Figure 3 (c) illustrates the relationship between and accuracy with respect to different fraction of selected agents. It is shown that, for each fraction, there exists an optimal value of for which a good performance can be achieved in a limited number of iterations.

Figure 3: Accuracy (MSE) of FedHLCR on the SynthHLCR dataset. (a) MSE w.r.t. different data size and noise. (b) MSE with 15% of agents sampled at each iteration. (c) MSE w.r.t. different .

Real data

We setup a character-recognition experiment on the FEMNIST dataset [Caldasetal18, Cohenetal17]

as packaged in Federated Tensorflow 

[tff]. This dataset stores images of handwritten characters together with an identifier of the writer (agent). Let us consider building a character recognition system using only a fraction of writers among the FEMNIST

dataset. We use a Convolutional Neural Network (CNN) taken from 

[McMahan2017] 111

The CNN has two 5x5 convolution layers of 32 and 64 channels, each followed with 2x2 max pooling, a fully connected ReLu layer with a size of 512, and a final softmax output layer with a size of either 10, 36 or 62 for

FEMNISTdigits, FEMNISTnocaps, and FEMNISTfull respectively.
which we train on this subset of writers. The problem we try to solve is to use the pre-trained CNN for seperating digits from letters in the whole FEMNIST dataset. Binary classification can in practice be tackled using regression on a {0, 1} target. First, for each subset with different fraction (), we train a CNN model on the subset. Then, we consider the following cases: 1) Baseline (CNN-only): The of the vector output by the CNN is used to predict the character and decides on the task (letter vs. digit); 2) LR: A Linear Regression is added after the CNN to perform the binary classification task; and 3) : A HLCR with cluster number is added after the CNN to perform the binary classification task. Here, we run with writers as agents, using either or clusters and a single entity per agent. Finally, we compare the results of “” and “” to those of “Baseline” and “LR”. The results are reported in Table 2. We see that the CNN training time on our CPU-only commodity machine increases linearly with the number of selected agents (around minutes for 1% agents for 100 iterations), while training the HLCR model after the CNN takes less than six minutes on the whole dataset.

AUC (%) Computation Time (min.)
Baseline LR Baseline LR
1 93.55 93.60 96.47 96.51 30 30.5 32.1 35.1
5 94.55 94.65 97.16 97.16 168 168.5 170.7 173.0
10 94.70 94.77 97.21 97.25 334 334.5 336.7 339.9
20 94.74 94.77 97.17 97.20 773 773.5 776.0 778.0
Table 2: AUC on Classification Tasks on FEMNIST. F

(%) denotes the fraction of data used in the training of the CNN. The Computation Time is the elapsed time on our CPU-only commodity machine. CNN training is limited to 100 FL rounds using FedAvg and 5 local epochs.

7 Conclusion

In this paper, we proposed a novel probabilistic model to deal with noisy, hierarchical and tabular datasets, named HLCR. By applying the Collapsed Gibbs Sampling technique, we efficiently inferred an HLCR model and theoretically guaranteed its convergence. Furthermore, we provided an HLCR algorithm in Federated Learning, called FedHLCR, for preserving the privacy of agents. Finally, the experimental results showed that the algorithm offers both fast convergence and good robustness to overfitting even in non-IID datasets. The immediate future work is to extend the model of each cluster in FedHLCR to a nonlinear model.

Broader Impact

We consider the regression problem for the data held by each agent in Federated Learning. We assume that each agent contains data (events) belonging to different entities, and such agent-entity-event hierarchical data widely exist in various realistic application scenarios, such as smartphones, IoT devices and medical data. In order to analyze and make prediction using such hierarchical data, we propose a Federated Hierarchical Latent Class Regression (FedHLCR) model, which is a mixture of linear regression, so it has richer expressiveness than simple linear regression. With its hierarchical mixture approach, FedHLCR can handle hierarchical data fairly well and given its similar complexity compared to complex models, such as Deep Neural Networks, it can be more efficiently trained by Collapsed Gibbs Sampling. We expect that it will be widely used in various applications. On the other hand, each agent in FedHLCR sends its local training result to the server. This information may cause a privacy risk when an agent only contains very few data. We suggest that FedHLCR application protects the personal information by introducing other privacy-enhancing technologies, such as differential privacy, into FedHLCR.


8 Appendix

8.1 Proofs

Proof (Proof of Theorem 4.2)

Although s here are NOT independent, generally we have


Let and denote -matrix and -vector . Then, we have and . With Bayes’ theorem,




By marginalizing out, we can get