CoFED: Cross-silo Heterogeneous Federated Multi-task Learning via Co-training

by   Xingjian Cao, et al.
NetEase, Inc

Federated Learning (FL) is a machine learning technique that enables participants to train high-quality models collaboratively without exchanging their private data. Participants in cross-silo FL settings are independent organizations with different task needs, and they are concerned not only with data privacy, but also with training independently their unique models due to intellectual property. Most existing FL schemes are incapability for the above scenarios. In this paper, we propose a communication-efficient FL scheme, CoFED, based on pseudo-labeling unlabeled data like co-training. To the best of our knowledge, it is the first FL scheme compatible with heterogeneous tasks, heterogeneous models, and heterogeneous training algorithms simultaneously. Experimental results show that CoFED achieves better performance with a lower communication cost. Especially for the non-IID settings and heterogeneous models, the proposed method improves the performance by 35


FedHe: Heterogeneous Models and Communication-Efficient Federated Learning

Federated learning (FL) is able to manage edge devices to cooperatively ...

Efficient Split-Mix Federated Learning for On-Demand and In-Situ Customization

Federated learning (FL) provides a distributed learning framework for mu...

Multi-Task Distributed Learning using Vision Transformer with Random Patch Permutation

The widespread application of artificial intelligence in health research...

FedComm: Federated Learning as a Medium for Covert Communication

Proposed as a solution to mitigate the privacy implications related to t...

FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data

Federated learning (FL) provides an effective paradigm to train machine ...

Towards Efficient and Stable K-Asynchronous Federated Learning with Unbounded Stale Gradients on Non-IID Data

Federated learning (FL) is an emerging privacy-preserving paradigm that ...

Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees

Federated learning (FL) has emerged as a prominent distributed learning ...

1 Introduction

Federated Learning (FL) enables multiple participants to collaborate in solving a machine learning problem, under the coordination of a center, without sharing participants’ private data[14]. The main purpose of FL is to improve model quality by leveraging the multi-party knowledge from participants’ private data without disclosure of data themselves.

FL is originally proposed for training a machine learning model across a large number of users’ mobile devices without logging their private data to a data center[17], [18]. In this setting, a center orchestrates edge devices to train a global model that serves a global task. However, some new FL settings have emerged greatly in many fields including medicine[35], [12], [31], [5], finance[29], [15], [42], and network security[28], [30], where participants are likely companies or organizations. Generally, the terms Cross-Device Federated Learning (CD-FL) and Cross-Silo Federated Learning (CS-FL) can refer to the above two FL settings[14]. However, the majority of these researches’ contributions are in training reward mechanism[34], topology design[24], data protection optimization[44], etc., and for core CS-FL algorithm, they just follow the idea of gradient aggregation used in CD-FL. They ignore the fact that organizations or companies as participants may be more heterogeneous than device participants.

One of the most important heterogeneity to deal with in CS-FL setting is model heterogeneity, that is, models may have different designed architectures. Different devices in CD-FL typically share the same model architecture, which is given by a center. As a result, the models obtained through local data training on different devices differ only in model parameters, allowing a center to directly aggregate the gradients or model parameters uploaded by devices. However, participants in CS-FL setting are usually independent companies or organizations, and they are capable to design unique models. They prefer to use their own independently designed model architecture rather than sharing the same model architecture as others. At this time, the strategy used in CD-FL can not be applied to the CS-FL models with different architectures. Furthermore, model architectures may also be intellectual properties that need to be protected, and companies or organizations that own it do not want it to be exposed to anyone else, which makes model architectures sharing hard to be accepted by CS-FL participants. An ideal CS-FL scheme should treat each participant’s model as a black box, without the need for the parameters or architecture of it.

The heterogeneity of model results in the need for training heterogeneity. In CD-FL setting, participants usually train their model according to the configuration of a center, which may include training algorithms (such as SGD) and parameter settings (such as learning rate, mini-batch size, etc.). However, when participants like companies or organizations use their independently designed model architectures in CS-FL, they need to choose different local training algorithms that are suitable for their models, and full control over their local training processes. Decoupling the local training of different participants not only enables them to choose an suitable algorithm for their models but also avoids the leakage of their training strategies which may be their intellectual properties.

In addition, CS-FL is more likely to face task heterogeneity than CD-FL. In CD-FL settings, all devices usually share the same target task. In terms of classification tasks, that is, the task output categories of all devices are exactly the same. In the CS-FL setting, because the participating companies or organizations are independent of each other and have different business needs, their tasks may be different. Of course, we need to assume that there are still similarities between these tasks. In terms of classification tasks, the task output categories of different participants in CS-FL may be different. For example, an autonomous driving technology company and a home video surveillance system company both need to complete their individual machine learning classification tasks. Although both of them need to recognize pedestrians, the former must also recognize vehicles, whereas the latter must recognize indoor fires. Therefore, the task output categories of the autonomous driving technology company include pedestrians and vehicles without the indoor fires, whereas the task output categories of the home video surveillance system company includes pedestrians and indoor fires without vehicles. It is easy to see that, in contrast to the complete consistency of tasks in CD-FL, CS-FL participation by independent companies or organizations is more likely to encounter the situation where there are heterogeneous tasks for different participants.

(a) Cross-silo FL setting without heterogeneity
(b) Cross-silo FL setting with heterogeneity (HFMTL)
Figure 1: (b) is an example of HFMTL setting for 3 participants. A central server needs to coordinate 3 participants to solve their classification tasks using federated training. Compared with (a) which has no heterogeneity, they have also different machine learning model architectures and training optimization algorithms, and their tasks are different from each other (i.e. different label spaces).

Although the number of participants in CS-FL is much smaller than that in CD-FL in general, the heterogeneity of participants including model heterogeneity, training heterogeneity and task heterogeneity may bring more challenges. Overall, we use the term Heterogeneous Federated Multi-task Learning (HFMTL) to refer to the FL settings that contain the above three heterogeneity requirements, and Fig. 1(b) shows an example of HFMTL setting for three participants. Different participants may have different label spaces. For example, Participant have a label space . It can be different from other participants, e.g. Participant has . The classification task of Participant

is to classify input to labels in

, while Participant is to classify input to . So they have different tasks and usually have different categories of local training data.

Our main motivation in this paper is to address the needs of model heterogeneity, training heterogeneity, and task heterogeneity in CS-FL settings. Aiming at the HFMTL scenario with these heterogeneities, we propose a novelty federated learning scheme. The main contributions of this paper are as follows.

  • We propose a FL scheme compatible with heterogeneous models, heterogeneous training, and heterogeneous tasks simultaneously, and it can significantly improve the model performance of each participant.

  • Compared with the existing FL scheme, the proposed scheme not only protects the private local data, but also protects the architectures of the private model and the private local training method.

  • Compared with existing FL schemes, the proposed scheme achieves more performance improvement for the non-IID data settings in FL, and can be completed in one round.

  • We conduct extensive experiments to confirm the theoretical analysis conclusions and the impact of different settings on the proposed scheme.

The remainder of this paper is organized as follows. We briefly review the related researches in Section 2. We show the preliminaries in Section 3. In Section 4, we describe the proposed scheme in detail. Section 5 presents the experiment results, and Section 6 summarizes the paper.

2 Related Work

In 2017, McMahan et al.[25] proposed the Federated Averaging (FedAvg) algorithm to solve the federated optimization problem on edge devices. In the FedAvg scheme, a central server orchestrates the following training process, by repeating a broadcasting phase, a local training phase, and a global aggregation phase until training is stopped.

  1. [label=(0), noitemsep]

  2. Broadcasting: The central server broadcasts the global model to the selected participants (edge devices).

  3. Local training

    : Each selected participant locally update the model by running Stochastic Gradient Descent (SGD) on its local data.

  4. Global aggregation: The central server collects an aggregate of the participant updates and updates the global model by averaging these aggregated updates.

This process template for FL training that encompasses FedAvg and many of its variants by Wang et al.[37] and Li et al.[22]

, works well for many CD-FL settings, where all participants usually serve a unified task and model designed by the center, and the center finely controls local training options (e.g. learning rate, number of epochs and mini-batch size). However, these schemes are not compatible with heterogeneous tasks, models, or training of different participants. For example, FedProx by Li

et al.[22] introduces proximal terms to improve FedAvg in the face of system heterogeneity (i.e., many straggles) and statistical heterogeneity, but it still inherits FedAvg polymerization parameters manner, therefore not compatible with heterogeneous models, especially for different architecture models.

In recent years, many efforts have been made to tackle the personalized tasks of participants in the FL setting. Wang et al.[36] and Jiang et al.[13] use the FL trained model as a pretrained or meta-learning model, and fine-tune the model to learn the local task of each participant. Arivazhagan et al.[1] use personalized layers to specialized the global model trained by FL to learn the personalized tasks of participants. Recently, Hanzely et al.[10] and Deng et al.[7] modify the original FedAvg, instead of aggressive averaging all the model parameters, they find that only steering the local parameters of all participants steps towards their average helps each participant to train its personalized model. Besides, Smith et al.[32]

propose a federated multi-task learning framework MOCHA, which clusters tasks based on their relationship by an estimated matrix.

It should be pointed out that, personalized tasks are different from heterogeneous tasks. The tasks in personalized settings always have the same label spaces while having different ones in the heterogeneous settings. A typical example of personalized tasks that can be given from[32] is to classify users’ activities using their mobile phone accelerometer and gyroscope data. For each user, the model needs to output from the same range of activity categories, but the classification for each user is regarded as a personalized task due to their differences in heights, weights, and personal habits. Despite the difference, from the perspective of data distribution, the data of heterogeneous classification tasks can be regarded as a non-Independent and Identically Distributed (non-IID) sampling results from an input space , which contains all instances of all labels in a label space , and the is the union of the label spaces of all participants. Each participant has only instances of some categories in since its label space is a subset of . Therefore, a FL scheme compatible with personalized tasks can be also used for heterogeneous tasks.

However, all these schemes require participants to upload their model parameters for global aggregation, which may leak participant models. Recently, several secure computing techniques have been introduced to protect data and model parameters in FL, including differential privacy[40], [11], [41], secure multi-party computing[43], [23], [47], homomorphic encryption[8], [33], and trusted execution environments[26], [4], but these schemes still have some disadvantages, such as a significant amount of communication or computational cost, or relying on specific hardware for implementation.

Another limitation for these FL schemes for personalized tasks is that the architecture of each participant’s model must be consistent because model parameters usually need to be aggregated or aligned in the FL training processes of these schemes. It prevents participants from independently designing unique model architectures. To address this issue, FedMD[20] proposed by Li et al.

, leverage knowledge distillation to transfer the knowledge of each participant’s local data, by aligning the logits of different neural networks on a public dataset. The participant models in FedMD can be neural networks with different architectures, except that they need to be consistent in the logits output layers. However, FedMD requires a large amount of labeled data as a public dataset for participant models alignment. Collecting labeled data is often much more difficult than collecting unlabeled data, and open large-scale labeled datasets are usually rare, which limits the application scenarios of FedMD. Similarly, Guha

et al.[9] use distillation models to generate a global ensemble model, which needs to share the participants’ local models or their distillation models. But it exposes the participants’ sensitive data or local model information to the central server, and this method is not suitable for participants who want to avoid leakage of their own models in CS-FL.

3 Preliminaries

We first formulate the HTMTL problem, and introduce the co-training method which inspires us for the proposed CoFED scheme.

3.1 Htmtl

Considering a FL setting, there are participants, and each of them has its own classification task , input space , output space , and the consisting of all valid pairs , where . Since we indicated that is a classification task, is the label space of

. Each participant expects to use supervised learning to train a machine learning model to perform its classification task, so each participant has its own local data


Since the HFMTL settings contains heterogeneous tasks and heterogeneous models , we can assume that for the general :


On the other hand, the classification tasks of each participant should have commonality with the tasks of other participants. In our setting, this commonality manifests as the overlap of label spaces. It means that:


Assuming that a model has been trained for , its generalization accuracy can be defined as:


where is expectation and is a indicator function, and .

The goal of this paper is to propose a HFMTL scheme compatible with heterogeneous tasks, heterogeneous models, and heterogeneous training. This scheme should help to improve the model performance of each participant, without sharing its local private dataset and private model :


Besides, assuming that the model locally trained by the participant is , we expect that the model trained with our FL scheme has better performance for each participant:


3.2 Co-training

In HFMTL settings, many tricky problems come from the difference between participant models and tasks. However, co-training is an effective method for training different models. Therefore, similar ideas to co-training can be used to solve problems under HFMTL settings.

Co-training, which is originally proposed by Blum et al.[2]

, is a semi-supervised learning technique exploiting unlabeled data with multi-views. The

view refers to a subset of attributes of an instance. It trains two different classifiers on two views; after that, both classifiers pseudo-label some unlabeled instances for another; then each classifier can be retrained on its original training set and the new set pseudo-labeled by another. Co-training repeats the above steps to improve the performance of classifiers until it stops. Blum et al. proves that when the two views are sufficient (i.e. each view contains enough information for an optimal classifier), redundant and conditionally independent, co-training can improve weak classifiers to arbitrarily high by exploiting unlabeled data. Wang et al.[39] prove that if the two views have large diversity, co-training suffers little from the label noise and sampling bias, and could output the approximation of the optimal classifier by exploiting unlabeled data even with insufficient views.

For the single view setting, Zhou et al.[46] propose the Tri-training, which uses 3 classifiers having large diversity to vote for the most confident pseudo-label of unlabeled data. Furthermore, Wang et al.[38] prove that classifiers with large diversity can also improve model performance in a single view setting. At the same time, they also point out the reason why the performance of classifiers in co-training saturates after some rounds. That is, classifiers learn from each other and gradually became similar, and their differences are insufficient to support further performance improvement.

In HFMTL settings, different participants usually have insufficient local data which are only one-sided for the overall distribution, especially under non-IID data settings. Therefore, the models trained on the local data of different participants may have large diversity, which can drive the co-training to work.

4 Methodology

4.1 The Overall Steps of CoFED

The main incentive of FL is to improve the performance of participant models. The reason for the poor performance of the locally trained model is usually insufficient local data, so the model fails to learn enough task knowledge by training on only the local dataset. Hence, to improve the performance of the participant models, it is necessary to enable them to acquire knowledge from other participants. The most common and direct ways to share knowledge are sharing data or models, but both are prohibited by our FL setting, so we need to find other ways.

The study related co-training shows that, to improve the performance of the classification task model, a large enough diversity between models is required[38]. Generally, it is tricky to generate models with large divergences in single view settings. However, in the FL settings, models of participants may have very different architectures, and be trained on unique local datasets which are very likely to be distributed differently. All of these may cause diversity between the models of different participants. Therefore, if there are enough unlabeled data, we can regard the federated classification problem as a semi-supervised learning problem, and it is very suitable for adopting the co-training like techniques due to the large diversity between different participant models. We give the overall steps of CoFED as follows.

  1. [label=(0),noitemsep]

  2. Local training: Each participant trains independently its model on local dataset.

  3. Pseudo-labeling: Each participant pseudo-labels an unlabeled dataset, which is public to all participants, by its local trained model.

  4. Pseudo-label aggregating: Each participant uploads its pseudo-labeling results to a central server, and the center votes for the pseudo-labeled dataset with high confidence for each category, based on the category overlap status of different participants and the pseudo-labeling results. After that, the center sends all pseudo-labels to each participant.

  5. Update training: Each participant updates its local model by training it on the new dataset merged from the local dataset and the received pseudo-labeled dataset.

It can be seen that there are some differences between the CoFED and the co-training of single-view settings. First, in co-training, the training sets used by different classifiers are the same, while the training sets of different classifiers in CoFED come from different participants, which are usually different. Secondly, the target tasks of different classifiers in co-training are the same, that is, they have the same label space. However, in CoFED, the label spaces of different classifiers are different, and pseudo-labeling unlabeled samples need be done according to the pseudo-label aggregation results of overlapping classification. Furthermore, co-training completes the training by repeating the above process, while it is performed only once in CoFED.

4.2 Analysis

Given two binary classification models and from a hypothesis space , and the oracle model whose generalization error is zero. We can define the generalization disagreement between and as:


So the generalization error of and can be compute as and respectively. Let bound the generalization error of a model and , a learning process generates a proximate model of respect to and if and only if:


Since we usually have only a training dataset containing finite samples, so the training process is to minimize the disagreement over :


Theorem 1. Given

is a Probably Approximately Correct learnable (PAC) model trained from

, is a PAC model trained from , and , . and satisfy that:


If we use to pseudo-label an unlabeled dataset , generating a pseudo-labeled dataset:


and combine and into a new training dataset . After that, is trained from by minimizing the empirical risk. Moreover,


where the is the base for natural logarithms, then


Theorem 1 has been proven in[38]. Assuming and are 2 models satisfy (9) and (10) from different participants. The right side of (12) is monotonically increasing as , which indicates that a larger pseudo-labeled dataset allows the bigger upper bound of . That is, if is a model trained on a larger training dataset with higher generalization accuracy, the larger the unlabeled dataset required to further improve the generalization accuracy of .

From (13) and (14), we can find that when the is bigger, the lower bound of the generalization error of is lower at the same confidence level. Since is trained on and , and is generated by , the is mainly dependent on how great divergence between and , or how different are between the training datasets of and . The large diversity between training datasets are very common in FL settings, so the condition for performance improvement is usually met. Due to the symmetry, swapping and , the same conclusion applies to the improved version of .

On the other hand, if is big enough since is generated by , the trained on can be treated as proximity to . That is to say, if we repeat the above process on and , and may be too similar to improve. Therefore, we choose using a large unlabeled dataset and only perform the above process once instead of iterating multiple rounds, which can also avoid the increase in computational cost caused by multiple training.

An intuitive explanation for the CoFED scheme is that when the models of different participants have great diversity, the difference in the knowledge possessed by the models will be greater. As a result, the knowledge acquired by each participant model from others contains more unknown knowledge to itself. Therefore, more new knowledge brings more significant performance improvements. Furthermore, when the mutual learning between different models is sufficient, the diversity in knowledge between them almost disappears. At this stage, mutual learning can hardly provide new knowledge for any participant.

4.3 Pseudo-Label Aggregating

After each participant uploads its pseudo-labeling results on the public unlabeled dataset, the center exploits their output to pseudo-label the unlabeled dataset. In this subsection, we explain the implementation details of this step.

Assuming the union of the label space of all participants’ tasks is


where is the number of elements in the whole label space , and each category exists in the label space of one or multiple participants’ label spaces:


where is the number of participants who have the category . At the same time, we define the pseudo-labeled dataset for category , and the will be used to store the indices of the instances in public dataset corresponding to each category after pseudo-label aggregation.

For each category existing in , the model classifies the instances in the public unlabeled dataset as category , the set of these instances can be defined as


For an instance , if we regard the output on of different participant models for category as the output of a two-class (belong to category or not) ensemble classifier , since (13) suggests a lower generalization error bounds is helpful, we can set a hyper-parameter to make the results more reliable. That is, if


An value of 0 means that whether is marked as category requires only one participant to agree, while an of 1 means that all participants’ consents are required. After that, we can put the index of into . After all the is generated, the central server should send the corresponding pseudo-labeled dataset to each participant’s task based on its label space . For the participants whose label space is , the corresponding pseudo-labeled dataset received from the center is:


Assuming that stores the result of , and , Algorithm 1 describes the above process.

Input: , , , ,
1 function Aggregating , , , ,
        // Initialization
       = empty dict = empty set  // Counting
2       forall  do
3             = empty set
       // Label aggregating
4       forall  to  do
5             = empty dict with default value 0 forall  to  do
7            forall  do
8                   if  then
9                         add to
10                   else
11                         continue
14      return
Algorithm 1 Pseudo-label aggregating

It should be pointed out that some indices of may exist in multiple because all of are different from each other. Therefore, the different in may overlap, resulting in contradictory labels. To build a compatible pseudo-labeled dataset , those indices contained by different should be removed from .

4.4 Unlabeled Dataset

To perform the CoFED scheme, we need to build a public unlabeled dataset for all participants. Although an unlabeled dataset highly relevant to the classification tasks is preferred, we find that even less relevant datasets can bring good enough results in our experiments. For the image classification tasks that we focus on, almost unlimited image resources that are publicly accessed on the Internet can be built into an unlabeled dataset. Another benefit of using resources that each participant can obtain independently is that it can avoid the distribution of unlabeled dataset by a central server, thereby saving the limited communication resources in a FL scenario.

The reason why less relevant datasets work is that even different objects may have commonality, we can use this commonality to be the target of the tasks in different objects. For example, if someone asks you what an apple looks like when you have nothing else but a pear, you might tell him that it looks similar to a pear. This may give him or her some wrong perceptions about apples, but it is better than nothing, and this one is at least less likely to recognize a banana as an apple. That is to say, although you have not been able to tell him exactly what an apple looks like, it still improves his ability to recognize apples. For a similar reason, even if the less relevant unlabeled dataset is used to transfer knowledge, it can also improve model performance in FL settings.

4.5 Training Process

In the CoFED scheme, each participant needs to train its model twice, i.e., the local training and the update training. Both training processes are performed locally, where there is no need to exchange any data with other participants or the central server. The benefits of this are as follows.

  1. [label=(0), noitemsep]

  2. It avoids the leakage of the data of the participants, including the local data or the model;

  3. It avoids the expense of stability and performance by communication;

  4. It decouples the training process of different participants so that they can choose the training algorithms and training configurations that are more suitable for their models independently.

4.6 Different Credibility of Participants

In practical applications, it may face the problem of different credibility of participants, resulting in the uneven quality of pseudo-labels over different participants. Credibility weights can be assigned to pseudo-labels provided by different participants. Accordingly, Algorithm 1 can be modified to calculate and by adding up the weight of each participant, and the unmodified version of Algorithm 1 is equivalent to the weight of each participant being 1.

Different basis can be used for setting weights. For example, since the quality of the model trained by the larger training set is generally higher, weights based on the size of the local dataset may be helpful for the unbalanced data problem. The test accuracy can also be used as a basis for weights, while participants may be reluctant to provide that. Therefore, we can decide according to the actual scene.

4.7 Non-IID Data Settings for Heterogeneous Tasks

Data can be non-IID in different ways. We have pointed out that heterogeneous task setting itself is an extreme non-IID case of the personalized task setting. However, in the heterogeneous task setting, the instance distribution of a single category in the local datasets of different participants can still be IID or non-IID. The IID case means that the instances of this category owned by different participants have the same probability distribution; while in the extreme non-IID case, each participant may only have the instances of one subclass of this category, and different participants have different subclasses. A non-IID example can be that, there are pets in the label space of two participants, but all the instances of pets in the local training set of one participant are dogs, while the other participant only has the images of cats in its local dataset.

The existing FL schemes based on parameter aggregation usually work well for IID data, but suffer from non-IID data. Zhao et al.[45]

show that the accuracy of convolution neural networks (CNN) trained with the FedAvg algorithm can reduce significantly, up to 55%, with highly skewed non-IID data. Since non-IID data cannot be avoided in practice, it is always regarded as an open challenge in FL

[14], [21]. Fortunately, in the CoFED scheme where model diversity is helpful to performance improvement, non-IID data setting is usually beneficial. That is because that models trained by non-IID data have more divergences than the IID ones generally.

5 Experiments

In this section, we perform the CoFED scheme in different FL settings to explore the impact of different conditions and compare it with existing FL schemes.

5.1 Datasets

CIFAR10 and CIFAR100[19]: The CIFAR10 dataset consists of 32x32 RGB images in 10 classes, with 6,000 images per class. There are 5,000 training images and 1,000 test images per class. The CIFAR100 is just like the CIFAR10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR100 are grouped into 20 superclasses. Each superclass consists of 5 classes, so there are 2,500 training images and 500 testing images per superclass. The classes in CIFAR10 are completely mutually exclusive, as well as CIFAR100.


: ImageNet is a large-scale hierarchical image dataset. It consist of 14,197,122 images in 21,841 synsets. In our experiment, we only use the valid dataset of the downsampled ImageNet. It consists of 49,999 32x32 RGB images randomly selected from the ImageNet dataset without their labels.

FEMNIST[3]: LEAF is a modular benchmark framework for learning in federated settings. FEMNIST is an image dataset for the classification task of LEAF. FEMNIST consists of 805,263 samples of handwritten characters (62 different classes including 10 digits, 26 uppercase and lowercase English characters) from 3550 users.

Chars74K-Kannada[6]: Chars74K contains character images of English and Kannada.We only use the Kannada handwritten character images of Chars74k as the unlabeled public dataset, and it consists of 16,425 images, and many of them are compound symbols which is a combination of a consonant and a vowel of Kannada.

Adult[16]: Adult is a census income dataset, and it was donated to the UCI Machine Learning Repository by R. Kohavi and B. Becker in 1994. It can be used to classify whether a person’s yearly salary is greater than $50K based on demographic data. There are 14 predictors: 6 numeric and 8 categorical, and it contains 32561 training samples and 16281 test samples.

5.2 Model Settings

The CoFED scheme enables participants to design different models independently. For example, some participants use CNN as classifiers, while others use SVM classifiers. We use CNNs with different architectures as the model for each participant for image classification tasks and the models to illustrate that CoFED is compatible with heterogeneous models. The model architecture of each participant is a randomly generated 2-layer or 3-layer convolutional neural network, using ReLU as the activation function with a following 2x2 max pooling layer. The number of filters in the convolutional layer is randomly selected from [20, 24, 32, 40, 48, 56, 80, 96] with an ascending order. The global average pool layer and fully connected layer (dense layer) are insert before the softmax layer of each network. 10 of the 100 model architectures we used are shown in Table


conv layer
conv layer
conv layer
1 24 3x3 filters 40 3x3 filters none
2 24 3x3 filters 32 3x3 filters 56 3x3 filters
3 20 3x3 filters 32 3x3 filters none
4 24 3x3 filters 40 3x3 filters 56 3x3 filters
5 20 3x3 filters 32 3x3 filters 80 3x3 filters
6 24 3x3 filters 32 3x3 filters 80 3x3 filters
7 32 3x3 filters 32 3x3 filters none
8 40 3x3 filters 56 3x3 filters none
9 32 3x3 filters 48 3x3 filters none
10 48 3x3 filters 56 3x3 filters 96 3x3 filters
Table 1: Network Architectures

To demonstrate that CoFED is applicable to a broader range of models and task types. We chose four classifiers as the participant model in the Adult dataset experiment: decision trees, SVMs, generalized additive models, and shallow neural networks.

5.3 CIFAR Dataset Experiment

We set the number of participants to 100 in our CIFAR experiments. For each participant, we randomly select a few from the 20 superclasses of CIFAR100 as its label space. Since we try to study the effect of the proposed scheme on heterogeneous tasks, the label space of different participants is generally different but there may be some overlap. The local dataset of each participant consists the samples of its label space.

For each superclass, their distribution among different participants who own this superclass sample may have two situations. In the first case, we assume that the samples of each participant with the superclass are uniformly and randomly sampled from all samples of the superclass in CIFAR100, that is, the IID settings. In this case, each participant usually has samples of all 5 subclasses of each superclass in its label space. In the second case, we assume that each participant who owns the superclass only has a sample of some of the subclasses of its superclass (in our experiment, 1 or 2 subclasses), that is, the non-IID settings. The details of experimental data settings are as follows.

IID Data Setting: Each participant is randomly assigned 6 to 8 superclasses in CIFAR100 as its label space. For the local training sets, each participant has 50 instances for each superclass from the CIFAR100 training set, and these 50 samples are evenly sampled from the samples of this superclass in the training set of CIFAR100. There is no overlap between the training sets of any participants. For the test sets, all instances in the test set of CIFAR100 are used, whereby each participant’s test set has 500 instances for each superclass.

Non-IID Data Setting: Almost the same as the IID settings, the difference is that the sample of a superclass of each participant is only randomly from 1 to 2 subclasses under the superclass in the CIFAR100 training set. The configuration of the test set is exactly the same as the IID settings. The non-IID data setting is generally considered more challenging than the IID data setting since the model that learns only 1 or 2 subclasses of superclass during the training process is required to recognize all 5 subclasses included in the test set.

In this experiment, the public unlabeled dataset used in CoFED is the training set of CIFAR10. Since CoFED is compatible with heterogeneous training processes, so we grid-search the optimal training parameter settings for each participant task. The training configuration optimized for each participant (including the trainer, learning rate, and learning rate decay) is used in the initial local training phase, and the update training settings in the final step are adjusted based on them. We always use a mini-batch size of 50 for the local training and 1000 for the update training.

(a) IID setting
(b) Non-IID setting
Figure 2: Results of CIFAR experiment. The X-Axis value is the relative test accuracy, which is the ratio of CoFED test accuracy and local test accuracy.

We test our proposed CoFED scheme under IID setting and non-IID setting separately, and the hyper-parameter is set to 0.3 in both settings. We compare the test classification accuracy of models trained by CoFED with that of local training, and the results are shown in Fig. 2(a) and Fig. 2(b). We find that the CoFED scheme improves the test accuracy of each participant model by 10%-32% in the IID setting with an average of 17.3%. It shows that CoFED can bring significant performance improvement to the participant model even when the model differences are small (such as IID data setting). For the non-IID data setting, the improvement by CoFED can be more significant due to the greater diversity among different participant models since the samples of same superclasses of different participants may come from completely different subclasses. In our experiment, the CoFED scheme achieves a relative improvement in the average test accuracy of 35.6%, and for each participant, the improvement ranges from 14% to 67%. This result is significantly higher than the IID settings, indicating that CoFED is less damaged by statistical heterogeneity.

5.4 FEMNIST Dataset Experiment

Different from the CIFAR dataset which is standard benchmarks in the general machine learning domain, FEMNIST dataset is customized for federated learning settings. We selected the 100 users with the largest number of samples from the 3550 users of FEMNIST as participants. 40% of the samples of each participant are used as training data, and the rest are test data. Each participant needs to train a model to recognize the user’s handwritten characters of 62 classes.

The model architecture of the participants is the same as that used in the CIFAR experiment, and the local training parameters are also tuned in a similar way as the CIFAR experiment. In this experiment, we use the random crops of images in the Chars74k-Kannada dataset to construct an unlabeled public dataset close to 50,000 in size. The hyper-parameter is set to 0.01, and the results are shown in Fig. 3. We find that the CoFED improved the test accuracy of model of almost all participants with an average of 15.6%.

Figure 3: Results of FEMNIST experiment.

5.5 Public Unlabeled Dataset Availability

One of the major concern in CoFED scheme is whether there is an available public dataset.The role of the public dataset is to express and share the knowledge that the participant model needs to learn. Generally, the participants in federated learning cannot learn knowledge from samples generated entirely by random values, but this does not mean that we must use samples that are highly related to the participants’ local data to construct a public dataset.

For example, in the CIFAR experiment, we use the samples of CIFAR100 to construct the local data of the participants, but use CIFAR10 as the public dataset. The categories of the data samples contained in CIFAR10 and CIFAR100 only overlap very little. Therefore, we can regard CIFAR10 as a dataset composed of pictures randomly obtained from the Internet, without considering the similarity between its content and the sample of participants (from CIFAR100). There is also a big difference in morphology between English characters and Kannada characters in the FEMNIST experiment. However, CoFED can effectively improve the model performance of almost all participants in both experiments, which makes us want to know how different participant models use public datasets that are not relevant to them to share knowledge. Therefore, we review the results of pseudo-label aggregation, and check how these models classify these irrelevant images. Fig. 4 shows a partial example of the pseudo-label aggregation results of the 10 participant models out of 100 participants.

Figure 4: The results of pseudo-label aggregating. The images from the training set of CIFAR10 are scattered in the superclasses of CIFAR100. 10 images per superclass are be randomly selected.

First, we notice that there are some trucks and automobiles being corrected classified into the vehicles_1. It indicates that the unlabeled instances whose category contained in the label spaces of participant models are more likely to be assigned to the correct category. But there are exceptions, some automobiles are assigned to the flowers or fruit and vegetables. A common feature of these images of these automobiles is that there are quite much red part in their images, which may be regarded as a distinctive feature of flowers or fruit and vegetables by the classifiers. Besides, those unlabeled samples that are not included in the label spaces of any classifier are also classified into those that match their visual characteristics. For example, the corresponding instances of aquatic_mammals and fish usually have a background of blue, which looks like water. Another interesting example is the people category. Although there are almost no human instances in the training set of CIFAR10, still some closest instances are given, including the person on a horse.

Moreover, we also try to replace the public dataset in the CIFAR experiment with the ImageNet dataset of almost same size as CIFAR10. The CoFED scheme can still achieve considerable performance improvement as shown in Fig. 5(a) and Fig. 5(b).

(a) IID setting
(b) Non-IID setting
Figure 5: Results of CIFAR experiment using ImageNet dataset as unlabeled public dataset.

5.6 Unlabeled Dataset Size

According to (12), (13) and (14), when the existing classifier has a higher generalization accuracy and a larger labeled training set, a larger unlabeled dataset is needed to improve its accuracy. This suggests that, in the CoFED scheme, a larger unlabeled dataset can bring a more obvious performance improvement for a given group of participant models. We redo the CIFAR experiment with 10 participants, and vary the size of the unlabeled dataset in the range 500 to 50,000 samples in the training set of CIFAR10. The experiment results are shown in Fig. 6.

Figure 6: Size of unlabeled public dataset v.s. mean relative test accuracy.

The experimental results are consist with the theoretical analysis. It shows that the strategy of increasing the size of the unlabeled dataset can be used to boost the performance improvement of all participant models. Considering that the difficulty of collecting unlabeled data is much lower than that of labeled data in general, this strategy is feasible in many practical application scenarios.

5.7 Hyper-parameter

From the theoretical analysis, increasing the reliability of pseudo-labeling is likely to bring more significant performance improvements, which is also very intuitive. Therefore, we use a hyper-parameter to improve the pseudo-label aggregation reliability. A larger requires a higher percentage of participants to agree to increase the reliability of pseudo-label, but it may also reduce the number of pseudo-label instances available. Especially, when the participants have a large disagreement, excessively requiring the consistency of the results of different participant models may stop the spread of multi-party knowledge.

In this section, we repeat the CIFAR experiment of 10 participants with different values of and record the changes in the total number of samples generated by pseudo-label aggregating when different values are taken in Fig. 7. The changes in the test accuracy of all participant models in CoFED are shown in Fig. 8.

Figure 7: v.s. size of pseudo-label aggregation result. For , the total number of IID and non-IID are both 0, which cannot be shown in the log scale.
Figure 8: v.s. mean relative test accuracy.

The results show that when the value of changes, its impact on the CoFED is not monotonous. Although an excessively large value of can increase the reliability of pseudo-labels, it also greatly reduces the number of samples in pseudo-label aggregation result, which may degrade training results. In the FEMNIST experiment, we find that a larger value may greatly reduce the samples of the pseudo-label aggregation results, so we set the value to 0.01.

At the same time, the excessively large value makes pseudo-label aggregation more inclined to choose the samples that most participants agree on. Since the sample has been approved by most participants, it is almost impossible to bring new knowledge to these participants. Besides, we also find that the larger value has a more severe impact on the non-IID data setting where performance degradation is more significant than the IID cases. This is because the differences between the models trained on the non-IID data are greater, and the number of samples that most participants agree on is less. Therefore, when the increases, the available samples decrease faster than the IID case as shown in Fig. 7, that causes greater performance degradation in non-IID settings.

On the other hand, too small results in lower reliability of pseudo-label aggregation results, which may introduce more samples that have been mislabeled, making the CoFED less effective. In addition, too small value may cause a large increase in the number of samples of pseudo-label aggregation, resulting in increased computation and communication overhead.

In summary, the effectiveness of the CoFED scheme can be affected by the value of hyper-parameter , and adopting appropriate value can achieve greater performance improvements and avoid unnecessary computation and communication costs.

5.8 Adult Dataset Experiment

We set the number of participants to 100 in our Adult experiments. For each participant, we randomly select 200 samples from the train set of Adult dataset without replacement. Since we try to study the effect of the proposed scheme on heterogeneous models, 4 types of classifier are chosen, decision trees, SVMs, generalized additive models, and shallow neural networks. The number of each of them is 25. The size of unlabeled public dataset used in this experiment is 5000, each sample of it is randomly generated with the valid value for the input properties used by classifiers. The test set of Adult is used to measure the performance of all classifier of each participant.

The result is shown in Fig.9. The proposed scheme improves the test accuracy of most classifiers with a average of 8.5%, and the performance of the classifier that has benefited the most has increased by more than 25%. This demonstrates that the proposed method is applicable not only to image classification tasks using CNN models, but also to non-image-classification tasks using traditional machine learning models.

Figure 9: Results of Adult experiment.

5.9 Compare with Other FL Schemes

To the best of our knowledge, CoFED is the first FL scheme that tries to be compatible with heterogeneous models, heterogeneous tasks, and heterogeneous training simultaneously. Therefore, there are few FL schemes to compare with it under HFMTL settings. We use two well-known compared methods. One is the personalized FedAvg method[7] to represent the classic parameter aggregation federated learning strategy which is not compatible with heterogeneous models with different architectures; the other one is FedMD method[20] that supports different neural network architectures to compare the performance of CoFED under heterogeneous models.

To enable Personalized FedAvg and FedMD to handle heterogeneous classification tasks, we treat each participant’s local label space as the union of them in (15). In this way, we can handle heterogeneous tasks with the idea for personalized FedAvg and FedMD. The main steps are as follows.

  1. [label=(0), noitemsep]

  2. Use the FedAvg or FedMD algorithm to train a global model whose label space is the union of the label spaces of all participants.

  3. Fine-tune the global model on the local dataset of each participant to get a personalized model for its local tasks.

In this comparison experiment, we use the data configuration of Section 4.3. Considering that the personalized FedAvg scheme does not support models with different architectures, we use 100 participants having neural network models with the same architecture and different parameter values to compare with it. In the experiment compared with FedMD, we adopted the same experimental setup as in Section 4.3, that is, we selected 100 participants with different architectures of neural network.

We tried a variety of different hyperparameter settings for better performance in personalized FedAvg experiment. we use

as the number of local training rounds in each communication round,

, the local mini-batch size used for the local updates, and participants also perform local transfer learning to train the personalized model in each communication round. In FedMD experiment, we use the training settings suggested by

[20] since it is similar to our experimental conditions, i.e., using the training set of CIFAR10 as the public dataset and using 5000 samples in each round for model alignment. It should be noted that FedMD uses the label data of the public dataset, which is different from the unlabeled one used in CoFED.

IID Accuracy 1.06 0.87 1.00*
Rounds 100 8 1
Non-IID Accuracy 0.94 0.94 1.00*
Rounds 150 17 1
  • Reference value.

  • The round of the best performance.

Table 2: Comparison Results
(a) IID setting
(b) Non-IID setting
Figure 10: Personalized FedAvg v.s. CoFED. To facilitate comparison, the CoFED model’s test accuracy is used as the reference accuracy, and the value of Y-Axis is the ratio of the comparison algorithm’s test accuracy to the reference accuracy. Since each participant has a corresponding ratio, the blue line represents the average of the ratios of all participants corresponding iteration round.
(a) IID setting
(b) Non-IID setting
Figure 11: FedMD v.s. CoFED.

The results of comparing FedAvg with CoFED are shown in Fig. 10. The relative test accuracy is calculated as the average of the ratios of all participants to the CoFED test accuracy. In IID settings, FedAvg reaches the performance of CoFED after 42 communication rounds, and is finally 6% ahead of it in Fig. 10(a). In non-IID settings, FedAvg stabilized after 150 communication rounds. At this time, CoFED still leads FedAvg by 6% in Fig. 10(b). The results of comparing FedAvg with CoFED are shown in Fig. 11. In both IID and non-IID settings, FedMD do not reach the performance CoFED. CoFED leads 14% in IID settings and 35% in non-IID settings.

In terms of communication cost, if we do not consider the communication overhead required to initially construct the public dataset, CoFED achieves better performance with lower communication overhead in all cases, because CoFED only needs to pass the label data (not the sample itself) during the training process And there is no need to iterate multiple rounds. This assumption is not unrealistic, because the construction of public datasets may not require the central server to distribute to the participants, but can be constructed in a way that the participants obtain from a third party, which does not cause communication cost between the participants and the central server. Even if we include that, CoFED still achieves better performance with lower communication cost except for the IID and identical architecture model settings. In fact, the IID setting of the FedAvg comparison experiment is not the most considered by CoFED, because the model architecture is the same and can be shared in that settings.

6 Conclusion

In this paper, we proposed a novel federated learning scheme CoFED compatible with heterogeneous tasks, heterogeneous models, and heterogeneous training simultaneously. Compared with the traditional method, CoFED is more suitable for the CS-FL settings with less participants but higher heterogeneity. CoFED decouples the model and the training process of different participants, which enables each participant to train its independently designed model for its unique task by its optimal training methods. Besides, CoFED protects not only the private data but also the private models and training strategies of all participants in FL settings. CoFED enables participants to share multi-party knowledge to improve the performance of their local models, and it improve the performance of federated learning in most cases. It gives encouraging results on non-IID data settings for heterogeneous models with different architectures, which are more practical but usually difficult to handle in existing FL schemes. Moreover, the CoFED scheme is communication-efficient since training can be done in only one communication rounds.

The CoFED scheme may be limited by the availability of public unlabeled datasets. Although we conduct numerous experiments to demonstrate that CoFED has low requirements for public datasets and that the use of irrelevant or randomly generated datasets is still effective, there may be some failure scenarios, which is a problem we hope to address in the future.


This research was partially supported by the National Key Research and Development Program of China (2019YFB1802800), PCL Future Greater-Bay Area Network Facilities for Large-scale Experiments and Applications (PCL2018KP001).


  • [1] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary (2019) Federated learning with personalization layers. arXiv preprint arXiv:1912.00818. Cited by: §2.
  • [2] A. Blum and T. Mitchell (1998) Combining labeled and unlabeled data with co-training. In

    Proceedings of the eleventh annual conference on Computational learning theory

    pp. 92–100. Cited by: §3.2.
  • [3] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §5.1.
  • [4] Y. Chen, F. Luo, T. Li, T. Xiang, Z. Liu, and J. Li (2020) A training-integrity privacy-preserving federated learning scheme with trusted execution environment. Information Sciences 522, pp. 69–79. Cited by: §2.
  • [5] P. Courtiol, C. Maussion, M. Moarii, E. Pronier, S. Pilcer, M. Sefta, P. Manceron, S. Toldo, M. Zaslavskiy, N. Le Stang, et al. (2019) Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nature medicine 25 (10), pp. 1519–1525. Cited by: §1.
  • [6] T. E. De Campos, B. R. Babu, M. Varma, et al. (2009) Character recognition in natural images.. VISAPP (2) 7. Cited by: §5.1.
  • [7] Y. Deng, M. M. Kamani, and M. Mahdavi (2020) Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461. Cited by: §2, §5.9.
  • [8] D. Gao, Y. Liu, A. Huang, C. Ju, H. Yu, and Q. Yang (2019) Privacy-preserving heterogeneous federated transfer learning. In 2019 IEEE International Conference on Big Data (Big Data), pp. 2552–2559. Cited by: §2.
  • [9] N. Guha, A. Talwalkar, and V. Smith (2019) One-shot federated learning. arXiv preprint arXiv:1902.11175. Cited by: §2.
  • [10] F. Hanzely and P. Richtárik (2020) Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516. Cited by: §2.
  • [11] R. Hu, Y. Guo, H. Li, Q. Pei, and Y. Gong (2020) Personalized federated learning with differential privacy. IEEE Internet of Things Journal. Cited by: §2.
  • [12] L. Huang, Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu (2020) LoAdaBoost: loss-based adaboost federated machine learning with reduced computational complexity on iid and non-iid intensive care data. Plos one 15 (4), pp. e0230706. Cited by: §1.
  • [13] Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan (2019) Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488. Cited by: §2.
  • [14] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §1, §1, §4.7.
  • [15] D. Kawa, S. Punyani, P. Nayak, A. Karkera, and V. Jyotinagar (2019) Credit risk assessment from combined bank records using federated learning. International Research Journal of Engineering and Technology 06. Cited by: §1.
  • [16] R. Kohavi et al. (1996)

    Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid.

    In Kdd, Vol. 96, pp. 202–207. Cited by: §5.1.
  • [17] J. Konečnỳ, B. McMahan, and D. Ramage (2015) Federated optimization: distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575. Cited by: §1.
  • [18] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik (2016) Federated optimization: distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527. Cited by: §1.
  • [19] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.1.
  • [20] D. Li and J. Wang (2019) FedMD: heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581. Cited by: §2, §5.9, §5.9.
  • [21] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine 37 (3), pp. 50–60. Cited by: §4.7.
  • [22] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2018) Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127. Cited by: §2.
  • [23] Y. Liu, Y. Kang, C. Xing, T. Chen, and Q. Yang (2020) A secure federated transfer learning framework. IEEE Intelligent Systems. Cited by: §2.
  • [24] O. Marfoq, C. Xu, G. Neglia, and R. Vidal (2020) Throughput-optimal topology design for cross-silo federated learning. arXiv preprint arXiv:2010.12229. Cited by: §1.
  • [25] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §2.
  • [26] F. Mo and H. Haddadi (2019) Efficient and private federated learning using tee. In EuroSys, Cited by: §2.
  • [27] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016)

    Pixel recurrent neural networks

    arXiv preprint arXiv:1601.06759. Cited by: §5.1.
  • [28] D. Preuveneers, V. Rimmer, I. Tsingenopoulos, J. Spooren, W. Joosen, and E. Ilie-Zudor (2018)

    Chained anomaly detection models for federated learning: an intrusion detection case study

    Applied Sciences 8 (12), pp. 2663. Cited by: §1.
  • [29] PRNewswire (2018) WeBank and swiss re signed cooperation mou. Technical report WeBank. Cited by: §1.
  • [30] W. Schneble (2018) Federated learning for intrusion detection systems in medical cyber-physical systems. Ph.D. Thesis, University of Washington. Cited by: §1.
  • [31] M. J. Sheller, G. A. Reina, B. Edwards, J. Martin, and S. Bakas (2018) Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. In International MICCAI Brainlesion Workshop, pp. 92–104. Cited by: §1.
  • [32] V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §2, §2.
  • [33] M. N. Soe (2020) Homomorphic encryption (he) enabled federated learning. Cited by: §2.
  • [34] M. Tang and V. W. Wong (2021) An incentive mechanism for cross-silo federated learning: a public goods perspective. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications, pp. 1–10. Cited by: §1.
  • [35] R. Tony, S. Micah, E. Brandon, M. Jason, and B. Spyridon (2019) Federated learning for medical imaging. Technical report Intel. Cited by: §1.
  • [36] K. Wang, R. Mathews, C. Kiddon, H. Eichner, F. Beaufays, and D. Ramage (2019) Federated evaluation of on-device personalization. arXiv preprint arXiv:1910.10252. Cited by: §2.
  • [37] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019) Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1205–1221. Cited by: §2.
  • [38] W. Wang and Z. Zhou (2007) Analyzing co-training style algorithms. In European conference on machine learning, pp. 454–465. Cited by: §3.2, §4.1, §4.2.
  • [39] W. Wang and Z. Zhou (2013) Co-training with insufficient views. In Asian conference on machine learning, pp. 467–482. Cited by: §3.2.
  • [40] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V. Poor (2020) Federated learning with differential privacy: algorithms and performance analysis. IEEE Transactions on Information Forensics and Security. Cited by: §2.
  • [41] B. Xin, W. Yang, Y. Geng, S. Chen, S. Wang, and L. Huang (2020) Private fl-gan: differential privacy synthetic data generation based on federated learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2927–2931. Cited by: §2.
  • [42] W. Yang, Y. Zhang, K. Ye, L. Li, and C. Xu (2019) FFD: a federated learning based method for credit card fraud detection. In International Conference on Big Data, pp. 18–32. Cited by: §1.
  • [43] B. Yin, H. Yin, Y. Wu, and Z. Jiang (2020) FDC: a secure federated deep learning mechanism for data collaborations in the internet of things. IEEE Internet of Things Journal. Cited by: §2.
  • [44] C. Zhang, S. Li, J. Xia, W. Wang, F. Yan, and Y. Liu (2020) Batchcrypt: efficient homomorphic encryption for cross-silo federated learning. In 2020 USENIX Annual Technical Conference (USENIXATC 20), pp. 493–506. Cited by: §1.
  • [45] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §4.7.
  • [46] Z. Zhou and M. Li (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering 17 (11), pp. 1529–1541. Cited by: §3.2.
  • [47] H. Zhu, Z. Li, M. Cheah, and R. S. M. Goh (2020) Privacy-preserving weighted federated learning within oracle-aided mpc framework. arXiv preprint arXiv:2003.07630. Cited by: §2.