DeepAI
Log In Sign Up

Compare Where It Matters: Using Layer-Wise Regularization To Improve Federated Learning on Heterogeneous Data

12/01/2021
by   Ha Min Son, et al.
0

Federated Learning is a widely adopted method to train neural networks over distributed data. One main limitation is the performance degradation that occurs when data is heterogeneously distributed. While many works have attempted to address this problem, these methods under-perform because they are founded on a limited understanding of neural networks. In this work, we verify that only certain important layers in a neural network require regularization for effective training. We additionally verify that Centered Kernel Alignment (CKA) most accurately calculates similarity between layers of neural networks trained on different data. By applying CKA-based regularization to important layers during training, we significantly improve performance in heterogeneous settings. We present FedCKA: a simple framework that out-performs previous state-of-the-art methods on various deep learning tasks while also improving efficiency and scalability.

READ FULL TEXT VIEW PDF
07/07/2022

FedHeN: Federated Learning in Heterogeneous Networks

We propose a novel training recipe for federated learning with heterogen...
10/25/2022

FedClassAvg: Local Representation Learning for Personalized Federated Learning on Heterogeneous Neural Networks

Personalized federated learning is aimed at allowing numerous clients to...
06/09/2021

No Fear of Heterogeneity: Classifier Calibration for Federated Learning with Non-IID Data

A central challenge in training classification models in the real-world ...
08/17/2020

Siloed Federated Learning for Multi-Centric Histopathology Datasets

While federated learning is a promising approach for training deep learn...
07/13/2022

TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

State-of-the-art federated learning methods can perform far worse than t...
06/25/2021

Implicit Gradient Alignment in Distributed and Federated Learning

A major obstacle to achieving global convergence in distributed and fede...
10/08/2021

Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Transformer-based architectures have been the subject of research aimed ...

1 Introduction

The success of deep learning in a plethora of fields has led to a countless number of research conducted to leverage its strengths Lecun et al. (2015). One main outcome resulting from this success is the mass collection of data Sejnowski (2018). As the collection of data increases at a rate much faster than that of the computing performance and storage capacity of consumer products, it is becoming progressively difficult to deploy trained state-of-the-art models within a reasonable budget.

Federated Learning (FL) McMahan et al. (2017) has been introduced as a method to train a neural network with massively distributed data. The most widely used and accepted approach for the training and aggregation process is FedAvg McMahan et al. (2017). FedAvg is appealing for many reasons, such as negating the cost of collecting data into a centralized location, and effective parallelization across computing units Verbraeken et al. (2019). Thus, it has been applied to a wide range of researches, including a distributed learning framework on vehicular networks Samarakoon et al. (2020), IoT devices Yang et al. (2020), and even as a privacy-preserving method for medical records Brisimi et al. (2018).

One major issue with the application of FL is the performance degradation that occurs with heterogeneous data. This refers to settings in which data is not independent and identically distributed (non-IID) across clients. The drop in performance is seen to be caused by a disagreement in local optima. That is, because different clients train its copy of the neural network according to its individual local data, the resulting average can stray from the true optimum. Unfortunately, it is realistic to expect non-IID data in many real-world applications Kairouz et al. (2021); Hsu et al. (2019b). In light of this, many works have attempted to address this problem by regularizing the entire model during the training process Li et al. (2020); Karimireddy et al. (2020); Li et al. (2021). However, we argue that these works are based on a limited understanding of neural networks.

In this work, we present FedCKA to address these limitations. First, we show that regularizing the first two naturally similar layers are most important to improve performance in non-IID settings. Previous works had regularized each individual layers. Not only is this ineffective for training, it also limits scalability as the number of layers in a model increases. By regularizing only these important layers, performance improves beyond previous works. Efficiency and scalability is also improved, as we do not need to calculate regularization terms for every layer. Second, we show that Centered Kernel Alignment (CKA) is most accurate when comparing the representational similarity between layers of neural networks. Previous works added a regularization term by comparing the representation of neural networks with simple inner products such as the l2-distance (FedProx) or cosine similarity (MOON). By using CKA to more accurately compare and regularize local updates, we improve performance; hence the name FedCKA. Our contributions are summarized as follows:

  • We improve performance in heterogeneous settings. By building on the most up-to-date understanding of neural networks, we apply layer-wise regularization to only important layers.

  • We improve the efficiency and scalability of regularization. By regularizing only important layers, we exclusively show training times that are comparable to FedAvg.

Figure 1: The typical steps of Federated Learning

2 Related Works

2.1 Layers in Neural Networks

Understanding the function of layers in a neural network is an under-researched field of deep learning. It is, however, an important prerequisite for the application of layer-wise regularization. We build our work based on findings of two relevant papers.

The first work Zhang et al. (2019) showed that there are certain ’critical’ layers that define a model’s performance. In particular, when layers were re-initialized back to their original weights, ’critical’ layers heavily decreased performance, while ’robust’ layers had minimal impact. This work drew several relevant conclusions. First, the very first layer of neural networks is most sensitive to re-initialization. Second, robustness is not correlated with the l2-norm or l-norm between initial weights and trained weights. Considering these conclusions, we understand that certain layers are not important in defining performance. Regularizing these non-important layers would be ineffective, and may even hurt performance.

The second work Kornblith et al. (2019)

introduced Centered Kernel Alignment (CKA) as a metric for measuring the similarity between layers of neural networks. In particular, the work showed that metrics that calculate the similarity between representations of neural networks should be invariant to orthogonal transformations and isotropic scaling, while invertible to linear transformations. This work drew one very relevant conclusion. For neural networks trained on different datasets, early layers, but not late layers, learn similar representations. Considering this conclusion, if we were to properly regularize neural networks trained on different datasets, we should focus on layers that are naturally similar, and not on those that are naturally different.

2.2 Federated Learning with Non-IID Data

Federated Learning typically progresses with the repetition of four steps as shown in Figure 1. 1) a centralized or de-centralized server broadcasts a model (the global model) to each of its clients. 2) Each client trains its copy of the model (the local model) with its local data. 3) The client uploads its trained model to the server. 4) The server aggregates the trained model into a single model and prepares it to be broadcast in the next round. These steps are repeated until convergence or other criteria are met.

Works that improve performance on non-IID data generally falls into two categories. The first focuses on regularizing or modifying the client training process (step 2). The second focuses on modifying the aggregation process (step 4). Here, we focus on the former, as it is more closely related to our work. Namely, we focus on FedProx Li et al. (2020), SCAFFOLD Karimireddy et al. (2020), and MOON Li et al. (2021), all of which add a regularization term to the default FedAvg McMahan et al. (2017) training process.

FedAvg was the first work to introduce Federated Learning. Each client trains a model using a gradient descent loss function, and the server averages the trained model based on the number of data samples each client holds. However, due to the performance degradation in non-IID settings, many works have added a regularization term to the default FedAvg training process. The objective of these methods is to decrease the disagreement in local optima by limiting local updates that stray too far from the global model. FedProx adds a proximal regularization term that calculates the l2-distance between the local and global model. SCAFFOLD adds a control variate regularization term that induces variance reduction on local updates based on the updates of other clients. Most recent and most similar to our work is MOON. MOON adds a contrastive regularization term that calculates the cosine similarity between the MLP projections of the local and global model. The work takes inspiration from contrastive learning, in particular, SimCLR

Chen et al. (2020). The intuition is that the global model is less biased than local models, thus local updates should be more similar to the global model than past local models. One difference to note is that while contrastive learning trains a model using the projections of one model on many different images (i.e. one model, different data), MOON regularizes a model using the projections of different models on the same images (i.e. three models, same data).

Overall, these works add a regularization term by comparing all layers of the neural network. However, we argue that only important layers should be regularized. Late layers are naturally dissimilar when trained on different datasets. Regularizing a model based on these naturally dissimilar late layers would be ineffective. Rather, it may be beneficial to focus only on the earlier layers of the model.

3 FedCKA

3.1 Regularizing Naturally Similar Layers

FedCKA is designed on the principle that naturally similar, but not naturally dissimilar, layers should be regularized. This is based on the premise that early layers, but not late layers, develop similar representations when trained on different datasets Kornblith et al. (2019)

. We verify this in a Federated Learning environment. Using a small Convolutional Neural Network, we trained 10 clients for 20 communications rounds on independently and identically distributed (IID) subsets of the CIFAR-10

Krizhevsky (2009) dataset. After training, we viewed the similarity between each layer of the local and global models, calculated by the Centered Kernel Alignment Kornblith et al. (2019) on the CIFAR-10 test set. The similarity of each layer between local and global models are shown in Figure 2. We verify that early layers, but not late layers, develop similar representations even in the most optimal Federated Learning setting, where the distribution across data between clients are IID.

The objective of regularizing local updates is to penalize updates that stray from the global model. However, late layers are naturally dissimilar even in optimal Federated Learning settings. If this is the case, regularizing these late layers would penalize updates that may have been beneficial to training. Thus, FedCKA regularizes only the first two naturally similar layers. For convolutional neural networks without residual blocks, the first two naturally similar layers are the two layers closest to the input. For ResNets He et al. (2016), it is the initial convolutional layer and first post-residual block. As also mentioned in Kornblith et al. (2019), post-residual layers, but not layers within residuals, develop similar representations. This is unique to previous works, which had regularized local updates based on all layers. This also allows FedCKA to be much more scalable than other methods. The computational overhead for previous works increases rapidly in proportion to the number of parameters, because all layers are regularized. FedCKA keeps the overhead nearly constant, as we regularize only two layers close to the input.

3.2 Measuring Layer-wise Similarity

FedCKA is designed to regularize dissimilar updates in layers that should naturally be similar. However, there is currently no standard for measuring the similarity of layers between neural networks. While there are classical methods of applying univariate or multivariate analysis for comparing matrices, these methods are not suitable for comparing the similarity of layers and representations of different neural networks

Kornblith et al. (2019). As for norms, Zhang et al. (2019) concluded that a layer’s robustness to re-initialization is not correlated with the l2-norm or l-norm. This suggests that using these norms to regularize dissimilar updates, as in previous works, may be inaccurate.

Kornblith et al. (2019) concluded that similarity metrics for comparing the representation of different neural networks should be invariant to orthogonal transformations and isotropic scaling, while invertible to linear transformation. The work introduced Centered Kernel Alignment (CKA), and showed that the metric is most consistent in measuring the similarity between representation of neural networks. Thus, FedCKA regularizes local updates using the CKA metric as a similarity measure.

Figure 2: CKA similarity comparison between each client and the global model (Refer to Experimental Setup for more information on setup)

3.3 Modifications to FedAvg

FedCKA adds a regularization term to the local training process of the default FedAvg algorithm, keeping the entire framework simple. Alg 1 and Fig 3 shows the FedCKA framework in algorithm and figure form, respectively. More formally, we add as a regularization term to the FedAvg training algorithm. The local loss function is as shown in Eq 1.

(1)

Here, is the cross entropy loss, is a hyper-parameter to control the strength of the regularization term, , in proportion to . is shown in more detail in Eq 2.

The formula of is a slight modification to the contrastive loss that is used in SimCLR Chen et al. (2020). There are four main differences. First, SimCLR uses the representations of one model on different samples in a batch to calculate contrastive loss. FedCKA uses the representation of three models on the same samples in a batch to calculate . , , and are the representations of client ’s current local model, client ’s previous round local model, and the current global model, respectively. Second, SimCLR uses the temperature parameter to increase performance on difficult samples. FedCKA excludes , as it was not seen to help performance. Third, SimCLR uses cosine similarity to measure the similarity between the representations of difference datasets. FedCKA uses CKA as its measure of similarity. Fourth, SimCLR calculates contrastive loss once per batch, using the representations of the projection head. FedCKA use calculates times per batch, using the representations of the first naturally similar layers, indexed by , and averages the loss based on the number of layers to regularize. is set to two by default unless otherwise stated.

Figure 3: Training Process of FedCKA

Input

: number of communication rounds R, number of clients C, number of local epochs E, loss weighting variable

, learning rate
Output: The trained model

1:  Initialize
2:  for each round  do
3:     for each client  do
4:         LocalUpdate()
5:     end for
6:     , , … ,
7:      WeightedAvg()
8:  end for
9:  return  
10:  LocalUpdate():
11:  
12:  for each epoch  do
13:     for each batch  do
14:        
15:        
16:        
17:        
18:     end for
19:  end for
20:  return solution  
21:  WeightedAvg():
22:  Initialize
23:  for each client  do
24:     
25:  end for
26:  return
Algorithm 1 FedCKA
(2)

As per Kornblith et al. (2019), CKA is shown in Eq 3. Here, the eigenvalue of is .

(3)

While Kornblith et al. (2019) also presented a method to use kernels with CKA, we use the linear variant, as it is more computationally efficient, while having minimal impact on accuracy.

(a)
(b)
(c)
Figure 4: Distribution of the CIFAR-10 dataset across 10 clients according to the Dirichlet distribution. The x-axis shows the index of the client, and the y-axis shows the index of the class (label). As the parameter approaches 0, there the heterogeneity of class distribution increases.

4 Experimental Results and Analysis

4.1 Experiment Setup

We compare FedCKA with the current state-of-the-art, MOON Li et al. (2021), as well as FedAvg McMahan et al. (2017), FedProx Li et al. (2020), and SCAFFOLD Karimireddy et al. (2020). We purposefully use a similar experimental setup to MOON, both because it is the most recent work, and also reports the highest performance. In particular, the CIFAR-10, CIFAR-100 Krizhevsky (2009)

, and Tiny ImageNet

Li et al. (2014) datasets are used to test the performance of all methods.

For CIFAR-10, we use a small Convolutional Neural Network. Two 5x5 convolutional layers are the base encoder, with 16 and 32 channels respectively, and two 2x2 max-pooling layers following each convolutional layer. A projection head of four fully connected layers follow the encoder, with 120, 84, 84, and 256 neurons. The final layer is the output layer with the number of classes. Although FedCKA and other works can perform without this projection head, we include it because MOON shows a high discrepancy in performance without it. For CIFAR-100 and Tiny ImageNet, we use ResNet-50

He et al. (2016). We also add the projection head before the output layer, as per MOON.

We use the cross entropy loss, and SGD as our optimizer with a learning rate of 0.1, momentum of 0.9, and weight decay of 0.00001. Local epochs are set to 10. These are also the parameters used in MOON. Some small changes we made were with the batch size and communication rounds. We use a constant 128 for the batch size, and train for 100 communication rounds on CIFAR-10, and 40 communication rounds on CIFAR-100, and 20 communication rounds on Tiny ImageNet. We use a lower number of communication rounds for the latter two datasets, because the ResNet-50 model over-fit quite quickly.

As with many previous works, we use the Dirichlet distribution to simulate heterogeneous settings Hsu et al. (2019a); Lin et al. (2021); Li et al. (2021). The parameter controls the strength of heterogeneity, with being most heterogeneous, and being non-heterogeneous. We report results for , similar to MOON. Figure 4 shows the distribution of data across clients on the CIFAR-10 dataset with the different

. All experiments were conducted using the PyTorch

Paszke et al. (2019) library on a single GTX Titan V and four Intel Xeon Gold 5115 processors.

4.2 Accuracy

FedCKA adds a hyperparameter

to control the strength of . We tune from [3, 5, 10], and report the best results. MOON and FedProx also have a term. We also tune the hyperparameter with these methods. For MOON, we tune from [0.1, 1, 5, 10] and for FedProx, we tune from [0.001, 0.01, 0.1, 1], as used in each work. In addition, for MOON, we use as reported in their work.

Method CIFAR-10 CIFAR-100
Tiny
ImageNet
FedAvg 64.37% 37.41% 19.49%
FedProx 64.58% 37.81% 20.93%
SCAFFOLD 64.33% 39.16% 21.18%
MOON 65.25% 38.37% 21.29%
FedCKA 67.86% 40.07% 21.46%
Table 1: Accuracy across Datasets ()

Table 1 shows the performance across CIFAR-10, CIFAR-100, and Tiny ImageNet with . For FedProx, MOON, and FedCKA, we report performance with the best . For FedCKA, the best is 3, 10, and 3 for CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. For MOON, the best is 10, 5, and 0.1. For FedProx, the best is 0.001, 0.1, and 0.1. Table 2 shows the performance across increasing heterogeneity on the CIFAR-10 dataset with . For FedCKA, the best is 5, 3, and 3 for each , respectively. For MOON, the best is 0.1, 10, and 10. For FedProx, the best is 0.001, 0.1, and 0.001.

Method = 5.0 = 0.5 = 0.1
FedAvg 64.37% 59.81% 50.43%
FedProx 64.58% 59.98% 51.07%
SCAFFOLD 64.33% 59.47% 40.53%
MOON 65.25% 60.65% 51.63%
FedCKA 67.86% 61.13% 52.35%
Table 2: Accuracy across (CIFAR-10)

We observe that FedCKA consistently outperforms previous methods across different datasets and across different . FedCKA improves performance in heterogeneous settings owing to regularizing layers that are naturally similar, and not layers that are naturally dissimilar. It is also interesting to see that FedCKA performs better by a larger margin when is larger. This is likely because the global model is less biased as data distribution approaches IID settings, thus can more effectively regularize updates. However, we also observe that other works consistently improve performance, albeit by a smaller margin than FedCKA. FedProx and SCAFFOLD improve performance likely owing to their inclusion of naturally similar layers in regularization. The performance gain is lower, as they also include naturally dissimilar layers in regularization. MOON improves performance compared to FedProx and SCAFFOLD likely owing to their use of a contrastive loss. That is, MOON shows that neural networks should be trained to be more similar to the global model than past local model, rather than only be blindly similar the global model. By regularizing naturally similar layers using a contrastive loss based on CKA, FedCKA outperforms all methods.

Note that across most methods and settings, there are discrepancies to the accuracy reported by MOON Li et al. (2021). In particular, MOON reports higher accuracy across all methods although model architecture are similar, if not equivalent. We suspect that data augmentation was used to increase accuracy. We could not test these settings, as MOON did not record their parameters. We thus report result without data augmentation techniques.

4.3 Regularizing Only Important Layers

We study the effects of regularizing different number of layers. Using the CIFAR-10 dataset with , we change the number of layers to regularize through . Formally, we change in Eq 2 by scaling , and report the accuracy in Figure 5. Accuracy is highest when only the first two layers are regularized. This verifies our claim that only naturally similar, but not naturally dissimilar layers should be regularized (Figure 2). In addition, note the dotted line representing the upper bound for Federated Learning. When the same model is trained on a centralized server with the whole CIFAR-10 dataset, accuracy is 70%. FedCKA with regularization on the first two naturally similar layers nearly reaches this upper bound.

Figure 5: Accuracy with respect to the number of layers regularized on CIFAR-10 and
Similarity Metric Accuracy
Training
Duration (s)
None (FedAvg) 64.37% 54.82
Frobenius Norm 65.54% 64.73
Vectorized Cosine 66.67% 65.75
Kernel CKA 67.93% 122.41
Linear CKA 67.86% 104.17
Table 3: Accuracy and training duration with FedCKA with different similarity metrics (CIFAR-10)

4.4 Using the Best Similarity Metric

We study the effects of regularizing the first two naturally similar layers with different similarity metrics. Using the CIFAR-10 dataset with , we change the similarity metric to regularize through . Formally, we change in Eq 2 to three other similarity metrics. First, the kernel CKA, introduced in Kornblith et al. (2019) (). Second, the squared Frobenius norm (). Third, the vectorized cosine similarity (). We compare the results with these different metrics as well as the baseline, FedAvg. The results are shown in Table 3.

We observe that performance is highest when CKA is used. This is likely owing to the accuracy of measuring similarity. Only truly dissimilar updates are penalized, thus improving performance. In addition, while kernel CKA slightly outperforms linear CKA, considering the computational overhead, we opt to use linear CKA. We also observe that the squared Frobenius norm and vectorized cosine similarity decrease performance only slightly. These methods outperform most previous works. This verifies that while it is important to use an accurate similarity measure, it is more important to focus on regularizing naturally similar layers.

Method 7 Layers
Time
Extended
50 Layers
Time
Extended
FedAvg 54.82 - 638.79 -
SCAFFOLD 57.19 2.37 967.04 328.25
FedProx 57.20 2.38 862.12 223.33
MOON 97.58 42.76 1689.28 1050.49
FedCKA 104.17 49.35 750.97 112.18
Table 4: Average Training Duration Per Communication Round (in seconds)

4.5 Efficiency and Scalability

Efficient and scalable local training is an important engineering principle of Federated Learning. That is, for Federated Learning to be applied to real-world applications, we must assume that clients have limited computing resources. Thus, we analyze the local training time of all methods, as shown in Table 4. Note that FedAvg is the lower bound for training time, since all other methods add a regularization term.

For a 7-layer CNN trained on CIFAR-10, the training time for all methods are fairly similar. FedCKA extends training by the largest amount, as the matrix multiplication operation to calculate the CKA similarity is proportionally expensive to the forward and back propagation of the small model. However, for ResNet-50 trained on Tiny ImageNet, we see that the training time of FedProx, SCAFFOLD, and MOON have increased exponentially. Only FedCKA has comparable training times to FedAvg. This is because FedProx and SCAFFOLD performs expensive operations on the weights of each layer, and MOON performs forward propagation on three models until the penultimate layer. All these operation scale exponentially as the number of layers increase. While FedCKA also performs forward propagation on three models, the number of layers remains static, thus being most efficient with medium sized models.

We emphasize that regularization must remain scalable for Federated Learning to be applied to state-of-the-art models. Even on ResNet-50, which is no longer considered a large model, other Federated Learning regularization methods lack scalabililty. This causes difficulty to test these methods with the current state-of-the-art models such as ViT Dosovitskiy et al. (2021) having 1.843 billion parameters, or slightly older models such as EfficientNet-B7 Tan and Le (2019) having 813 layers.

5 Conclusion and Future Work

Improving the performance of Federated Learning on heterogeneous data is a widely researched topic. However, many previous works have incorrectly suggested that regularizing every layer of neural networks during local training is the best method to increase performance. We propose FedCKA, built on the most up-to-date understanding of neural networks. By regularizing naturally similar, but not naturally dissimilar layers during local training, performance improves beyond previous works. We also show that FedCKA is the only existing regularization method with adequate scalability when trained with a moderate sized model.

FedCKA shows that the proper regularization of important layers improves the performance of Federated Learning on heterogeneous data. However, standardizing the comparison of neural networks is an important step in a deeper understanding of neural networks. Moreover, there are questions as to the accuracy of CKA in measuring similarity in models such as Transformers or Graph Neural Networks. These are some topics we leave for future works.

References

  • T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. Ch. Paschalidis, and W. Shi (2018) Federated learning of predictive models from federated electronic health records. International Journal of Medical Informatics 112, pp. 59–67. External Links: ISSN 1386-5056, Document, Link Cited by: §1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §2.2, §3.3.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, External Links: Link Cited by: §4.5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2016)

    ,
    pp. 770–778. Cited by: §3.1, §4.1.
  • T. H. Hsu, H. Qi, and M. Brown (2019a) Measuring the effects of non-identical data distribution for federated visual classification. External Links: 1909.06335 Cited by: §4.1.
  • T. H. Hsu, H. Qi, and M. Brown (2019b) Measuring the effects of non-identical data distribution for federated visual classification. External Links: 1909.06335 Cited by: §1.
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konečný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao (2021) Advances and open problems in federated learning. External Links: 1912.04977 Cited by: §1.
  • S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh (2020) SCAFFOLD: stochastic controlled averaging for federated learning. In

    Proceedings of the 37th International Conference on Machine Learning

    , H. D. III and A. Singh (Eds.),
    Proceedings of Machine Learning Research, Vol. 119, pp. 5132–5143. External Links: Link Cited by: §1, §2.2, §4.1.
  • S. Kornblith, M. Norouzi, H. Lee, and G. Hinton (2019) Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 3519–3529. External Links: Link Cited by: §2.1, §3.1, §3.1, §3.2, §3.2, §3.3, §3.3, §4.4.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. External Links: Link Cited by: §3.1, §4.1.
  • Y. Lecun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. External Links: Document, Link Cited by: §1.
  • F. Li, A. Karpathy, and J. Johnson (2014) Tiny imagenet. External Links: Link Cited by: §4.1.
  • Q. Li, B. He, and D. Song (2021) Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), pp. 10713–10722. Cited by: §1, §2.2, §4.1, §4.1, §4.2.
  • T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020) Federated optimization in heterogeneous networks. External Links: 1812.06127 Cited by: §1, §2.2, §4.1.
  • T. Lin, L. Kong, S. U. Stich, and M. Jaggi (2021) Ensemble distillation for robust model fusion in federated learning. External Links: 2006.07242 Cited by: §4.1.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In AISTATS, Cited by: §1, §2.2, §4.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.1.
  • S. Samarakoon, M. Bennis, W. Saad, and M. Debbah (2020) Distributed federated learning for ultra-reliable low-latency vehicular communications. IEEE Transactions on Communications 68 (2), pp. 1146–1159. External Links: Document Cited by: §1.
  • T. J. Sejnowski (Ed.) (2018) The deep learning revolution. The MIT Press, Cambridge, Massachusetts. Cited by: §1.
  • M. Tan and Q. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. External Links: Link Cited by: §4.5.
  • J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer (2019) A survey on distributed machine learning. External Links: 1912.09789 Cited by: §1.
  • K. Yang, T. Jiang, Y. Shi, and Z. Ding (2020) Federated learning via over-the-air computation. IEEE Transactions on Wireless Communications 19 (3), pp. 2022–2035. External Links: Document Cited by: §1.
  • C. Zhang, S. Bengio, and Y. Singer (2019) Are all layers created equal?. In ICML 2019 Workshop on Identifying and Understanding Deep Learning Phenomena, Long Beach, California, United States. Cited by: §2.1, §3.2.