The success of deep learning in a plethora of fields has led to a countless number of research conducted to leverage its strengths Lecun et al. (2015). One main outcome resulting from this success is the mass collection of data Sejnowski (2018). As the collection of data increases at a rate much faster than that of the computing performance and storage capacity of consumer products, it is becoming progressively difficult to deploy trained state-of-the-art models within a reasonable budget.
Federated Learning (FL) McMahan et al. (2017) has been introduced as a method to train a neural network with massively distributed data. The most widely used and accepted approach for the training and aggregation process is FedAvg McMahan et al. (2017). FedAvg is appealing for many reasons, such as negating the cost of collecting data into a centralized location, and effective parallelization across computing units Verbraeken et al. (2019). Thus, it has been applied to a wide range of researches, including a distributed learning framework on vehicular networks Samarakoon et al. (2020), IoT devices Yang et al. (2020), and even as a privacy-preserving method for medical records Brisimi et al. (2018).
One major issue with the application of FL is the performance degradation that occurs with heterogeneous data. This refers to settings in which data is not independent and identically distributed (non-IID) across clients. The drop in performance is seen to be caused by a disagreement in local optima. That is, because different clients train its copy of the neural network according to its individual local data, the resulting average can stray from the true optimum. Unfortunately, it is realistic to expect non-IID data in many real-world applications Kairouz et al. (2021); Hsu et al. (2019b). In light of this, many works have attempted to address this problem by regularizing the entire model during the training process Li et al. (2020); Karimireddy et al. (2020); Li et al. (2021). However, we argue that these works are based on a limited understanding of neural networks.
In this work, we present FedCKA to address these limitations. First, we show that regularizing the first two naturally similar layers are most important to improve performance in non-IID settings. Previous works had regularized each individual layers. Not only is this ineffective for training, it also limits scalability as the number of layers in a model increases. By regularizing only these important layers, performance improves beyond previous works. Efficiency and scalability is also improved, as we do not need to calculate regularization terms for every layer. Second, we show that Centered Kernel Alignment (CKA) is most accurate when comparing the representational similarity between layers of neural networks. Previous works added a regularization term by comparing the representation of neural networks with simple inner products such as the l2-distance (FedProx) or cosine similarity (MOON). By using CKA to more accurately compare and regularize local updates, we improve performance; hence the name FedCKA. Our contributions are summarized as follows:
We improve performance in heterogeneous settings. By building on the most up-to-date understanding of neural networks, we apply layer-wise regularization to only important layers.
We improve the efficiency and scalability of regularization. By regularizing only important layers, we exclusively show training times that are comparable to FedAvg.
2 Related Works
2.1 Layers in Neural Networks
Understanding the function of layers in a neural network is an under-researched field of deep learning. It is, however, an important prerequisite for the application of layer-wise regularization. We build our work based on findings of two relevant papers.
The first work Zhang et al. (2019) showed that there are certain ’critical’ layers that define a model’s performance. In particular, when layers were re-initialized back to their original weights, ’critical’ layers heavily decreased performance, while ’robust’ layers had minimal impact. This work drew several relevant conclusions. First, the very first layer of neural networks is most sensitive to re-initialization. Second, robustness is not correlated with the l2-norm or l-norm between initial weights and trained weights. Considering these conclusions, we understand that certain layers are not important in defining performance. Regularizing these non-important layers would be ineffective, and may even hurt performance.
The second work Kornblith et al. (2019)
introduced Centered Kernel Alignment (CKA) as a metric for measuring the similarity between layers of neural networks. In particular, the work showed that metrics that calculate the similarity between representations of neural networks should be invariant to orthogonal transformations and isotropic scaling, while invertible to linear transformations. This work drew one very relevant conclusion. For neural networks trained on different datasets, early layers, but not late layers, learn similar representations. Considering this conclusion, if we were to properly regularize neural networks trained on different datasets, we should focus on layers that are naturally similar, and not on those that are naturally different.
2.2 Federated Learning with Non-IID Data
Federated Learning typically progresses with the repetition of four steps as shown in Figure 1. 1) a centralized or de-centralized server broadcasts a model (the global model) to each of its clients. 2) Each client trains its copy of the model (the local model) with its local data. 3) The client uploads its trained model to the server. 4) The server aggregates the trained model into a single model and prepares it to be broadcast in the next round. These steps are repeated until convergence or other criteria are met.
Works that improve performance on non-IID data generally falls into two categories. The first focuses on regularizing or modifying the client training process (step 2). The second focuses on modifying the aggregation process (step 4). Here, we focus on the former, as it is more closely related to our work. Namely, we focus on FedProx Li et al. (2020), SCAFFOLD Karimireddy et al. (2020), and MOON Li et al. (2021), all of which add a regularization term to the default FedAvg McMahan et al. (2017) training process.
FedAvg was the first work to introduce Federated Learning. Each client trains a model using a gradient descent loss function, and the server averages the trained model based on the number of data samples each client holds. However, due to the performance degradation in non-IID settings, many works have added a regularization term to the default FedAvg training process. The objective of these methods is to decrease the disagreement in local optima by limiting local updates that stray too far from the global model. FedProx adds a proximal regularization term that calculates the l2-distance between the local and global model. SCAFFOLD adds a control variate regularization term that induces variance reduction on local updates based on the updates of other clients. Most recent and most similar to our work is MOON. MOON adds a contrastive regularization term that calculates the cosine similarity between the MLP projections of the local and global model. The work takes inspiration from contrastive learning, in particular, SimCLRChen et al. (2020). The intuition is that the global model is less biased than local models, thus local updates should be more similar to the global model than past local models. One difference to note is that while contrastive learning trains a model using the projections of one model on many different images (i.e. one model, different data), MOON regularizes a model using the projections of different models on the same images (i.e. three models, same data).
Overall, these works add a regularization term by comparing all layers of the neural network. However, we argue that only important layers should be regularized. Late layers are naturally dissimilar when trained on different datasets. Regularizing a model based on these naturally dissimilar late layers would be ineffective. Rather, it may be beneficial to focus only on the earlier layers of the model.
3.1 Regularizing Naturally Similar Layers
FedCKA is designed on the principle that naturally similar, but not naturally dissimilar, layers should be regularized. This is based on the premise that early layers, but not late layers, develop similar representations when trained on different datasets Kornblith et al. (2019)
. We verify this in a Federated Learning environment. Using a small Convolutional Neural Network, we trained 10 clients for 20 communications rounds on independently and identically distributed (IID) subsets of the CIFAR-10Krizhevsky (2009) dataset. After training, we viewed the similarity between each layer of the local and global models, calculated by the Centered Kernel Alignment Kornblith et al. (2019) on the CIFAR-10 test set. The similarity of each layer between local and global models are shown in Figure 2. We verify that early layers, but not late layers, develop similar representations even in the most optimal Federated Learning setting, where the distribution across data between clients are IID.
The objective of regularizing local updates is to penalize updates that stray from the global model. However, late layers are naturally dissimilar even in optimal Federated Learning settings. If this is the case, regularizing these late layers would penalize updates that may have been beneficial to training. Thus, FedCKA regularizes only the first two naturally similar layers. For convolutional neural networks without residual blocks, the first two naturally similar layers are the two layers closest to the input. For ResNets He et al. (2016), it is the initial convolutional layer and first post-residual block. As also mentioned in Kornblith et al. (2019), post-residual layers, but not layers within residuals, develop similar representations. This is unique to previous works, which had regularized local updates based on all layers. This also allows FedCKA to be much more scalable than other methods. The computational overhead for previous works increases rapidly in proportion to the number of parameters, because all layers are regularized. FedCKA keeps the overhead nearly constant, as we regularize only two layers close to the input.
3.2 Measuring Layer-wise Similarity
FedCKA is designed to regularize dissimilar updates in layers that should naturally be similar. However, there is currently no standard for measuring the similarity of layers between neural networks. While there are classical methods of applying univariate or multivariate analysis for comparing matrices, these methods are not suitable for comparing the similarity of layers and representations of different neural networksKornblith et al. (2019). As for norms, Zhang et al. (2019) concluded that a layer’s robustness to re-initialization is not correlated with the l2-norm or l-norm. This suggests that using these norms to regularize dissimilar updates, as in previous works, may be inaccurate.
Kornblith et al. (2019) concluded that similarity metrics for comparing the representation of different neural networks should be invariant to orthogonal transformations and isotropic scaling, while invertible to linear transformation. The work introduced Centered Kernel Alignment (CKA), and showed that the metric is most consistent in measuring the similarity between representation of neural networks. Thus, FedCKA regularizes local updates using the CKA metric as a similarity measure.
3.3 Modifications to FedAvg
FedCKA adds a regularization term to the local training process of the default FedAvg algorithm, keeping the entire framework simple. Alg 1 and Fig 3 shows the FedCKA framework in algorithm and figure form, respectively. More formally, we add as a regularization term to the FedAvg training algorithm. The local loss function is as shown in Eq 1.
Here, is the cross entropy loss, is a hyper-parameter to control the strength of the regularization term, , in proportion to . is shown in more detail in Eq 2.
The formula of is a slight modification to the contrastive loss that is used in SimCLR Chen et al. (2020). There are four main differences. First, SimCLR uses the representations of one model on different samples in a batch to calculate contrastive loss. FedCKA uses the representation of three models on the same samples in a batch to calculate . , , and are the representations of client ’s current local model, client ’s previous round local model, and the current global model, respectively. Second, SimCLR uses the temperature parameter to increase performance on difficult samples. FedCKA excludes , as it was not seen to help performance. Third, SimCLR uses cosine similarity to measure the similarity between the representations of difference datasets. FedCKA uses CKA as its measure of similarity. Fourth, SimCLR calculates contrastive loss once per batch, using the representations of the projection head. FedCKA use calculates times per batch, using the representations of the first naturally similar layers, indexed by , and averages the loss based on the number of layers to regularize. is set to two by default unless otherwise stated.
While Kornblith et al. (2019) also presented a method to use kernels with CKA, we use the linear variant, as it is more computationally efficient, while having minimal impact on accuracy.
4 Experimental Results and Analysis
4.1 Experiment Setup
We compare FedCKA with the current state-of-the-art, MOON Li et al. (2021), as well as FedAvg McMahan et al. (2017), FedProx Li et al. (2020), and SCAFFOLD Karimireddy et al. (2020). We purposefully use a similar experimental setup to MOON, both because it is the most recent work, and also reports the highest performance. In particular, the CIFAR-10, CIFAR-100 Krizhevsky (2009)
, and Tiny ImageNetLi et al. (2014) datasets are used to test the performance of all methods.
For CIFAR-10, we use a small Convolutional Neural Network. Two 5x5 convolutional layers are the base encoder, with 16 and 32 channels respectively, and two 2x2 max-pooling layers following each convolutional layer. A projection head of four fully connected layers follow the encoder, with 120, 84, 84, and 256 neurons. The final layer is the output layer with the number of classes. Although FedCKA and other works can perform without this projection head, we include it because MOON shows a high discrepancy in performance without it. For CIFAR-100 and Tiny ImageNet, we use ResNet-50He et al. (2016). We also add the projection head before the output layer, as per MOON.
We use the cross entropy loss, and SGD as our optimizer with a learning rate of 0.1, momentum of 0.9, and weight decay of 0.00001. Local epochs are set to 10. These are also the parameters used in MOON. Some small changes we made were with the batch size and communication rounds. We use a constant 128 for the batch size, and train for 100 communication rounds on CIFAR-10, and 40 communication rounds on CIFAR-100, and 20 communication rounds on Tiny ImageNet. We use a lower number of communication rounds for the latter two datasets, because the ResNet-50 model over-fit quite quickly.
As with many previous works, we use the Dirichlet distribution to simulate heterogeneous settings Hsu et al. (2019a); Lin et al. (2021); Li et al. (2021). The parameter controls the strength of heterogeneity, with being most heterogeneous, and being non-heterogeneous. We report results for , similar to MOON. Figure 4 shows the distribution of data across clients on the CIFAR-10 dataset with the different
. All experiments were conducted using the PyTorchPaszke et al. (2019) library on a single GTX Titan V and four Intel Xeon Gold 5115 processors.
FedCKA adds a hyperparameterto control the strength of . We tune from [3, 5, 10], and report the best results. MOON and FedProx also have a term. We also tune the hyperparameter with these methods. For MOON, we tune from [0.1, 1, 5, 10] and for FedProx, we tune from [0.001, 0.01, 0.1, 1], as used in each work. In addition, for MOON, we use as reported in their work.
Table 1 shows the performance across CIFAR-10, CIFAR-100, and Tiny ImageNet with . For FedProx, MOON, and FedCKA, we report performance with the best . For FedCKA, the best is 3, 10, and 3 for CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. For MOON, the best is 10, 5, and 0.1. For FedProx, the best is 0.001, 0.1, and 0.1. Table 2 shows the performance across increasing heterogeneity on the CIFAR-10 dataset with . For FedCKA, the best is 5, 3, and 3 for each , respectively. For MOON, the best is 0.1, 10, and 10. For FedProx, the best is 0.001, 0.1, and 0.001.
|Method||= 5.0||= 0.5||= 0.1|
We observe that FedCKA consistently outperforms previous methods across different datasets and across different . FedCKA improves performance in heterogeneous settings owing to regularizing layers that are naturally similar, and not layers that are naturally dissimilar. It is also interesting to see that FedCKA performs better by a larger margin when is larger. This is likely because the global model is less biased as data distribution approaches IID settings, thus can more effectively regularize updates. However, we also observe that other works consistently improve performance, albeit by a smaller margin than FedCKA. FedProx and SCAFFOLD improve performance likely owing to their inclusion of naturally similar layers in regularization. The performance gain is lower, as they also include naturally dissimilar layers in regularization. MOON improves performance compared to FedProx and SCAFFOLD likely owing to their use of a contrastive loss. That is, MOON shows that neural networks should be trained to be more similar to the global model than past local model, rather than only be blindly similar the global model. By regularizing naturally similar layers using a contrastive loss based on CKA, FedCKA outperforms all methods.
Note that across most methods and settings, there are discrepancies to the accuracy reported by MOON Li et al. (2021). In particular, MOON reports higher accuracy across all methods although model architecture are similar, if not equivalent. We suspect that data augmentation was used to increase accuracy. We could not test these settings, as MOON did not record their parameters. We thus report result without data augmentation techniques.
4.3 Regularizing Only Important Layers
We study the effects of regularizing different number of layers. Using the CIFAR-10 dataset with , we change the number of layers to regularize through . Formally, we change in Eq 2 by scaling , and report the accuracy in Figure 5. Accuracy is highest when only the first two layers are regularized. This verifies our claim that only naturally similar, but not naturally dissimilar layers should be regularized (Figure 2). In addition, note the dotted line representing the upper bound for Federated Learning. When the same model is trained on a centralized server with the whole CIFAR-10 dataset, accuracy is 70%. FedCKA with regularization on the first two naturally similar layers nearly reaches this upper bound.
4.4 Using the Best Similarity Metric
We study the effects of regularizing the first two naturally similar layers with different similarity metrics. Using the CIFAR-10 dataset with , we change the similarity metric to regularize through . Formally, we change in Eq 2 to three other similarity metrics. First, the kernel CKA, introduced in Kornblith et al. (2019) (). Second, the squared Frobenius norm (). Third, the vectorized cosine similarity (). We compare the results with these different metrics as well as the baseline, FedAvg. The results are shown in Table 3.
We observe that performance is highest when CKA is used. This is likely owing to the accuracy of measuring similarity. Only truly dissimilar updates are penalized, thus improving performance. In addition, while kernel CKA slightly outperforms linear CKA, considering the computational overhead, we opt to use linear CKA. We also observe that the squared Frobenius norm and vectorized cosine similarity decrease performance only slightly. These methods outperform most previous works. This verifies that while it is important to use an accurate similarity measure, it is more important to focus on regularizing naturally similar layers.
4.5 Efficiency and Scalability
Efficient and scalable local training is an important engineering principle of Federated Learning. That is, for Federated Learning to be applied to real-world applications, we must assume that clients have limited computing resources. Thus, we analyze the local training time of all methods, as shown in Table 4. Note that FedAvg is the lower bound for training time, since all other methods add a regularization term.
For a 7-layer CNN trained on CIFAR-10, the training time for all methods are fairly similar. FedCKA extends training by the largest amount, as the matrix multiplication operation to calculate the CKA similarity is proportionally expensive to the forward and back propagation of the small model. However, for ResNet-50 trained on Tiny ImageNet, we see that the training time of FedProx, SCAFFOLD, and MOON have increased exponentially. Only FedCKA has comparable training times to FedAvg. This is because FedProx and SCAFFOLD performs expensive operations on the weights of each layer, and MOON performs forward propagation on three models until the penultimate layer. All these operation scale exponentially as the number of layers increase. While FedCKA also performs forward propagation on three models, the number of layers remains static, thus being most efficient with medium sized models.
We emphasize that regularization must remain scalable for Federated Learning to be applied to state-of-the-art models. Even on ResNet-50, which is no longer considered a large model, other Federated Learning regularization methods lack scalabililty. This causes difficulty to test these methods with the current state-of-the-art models such as ViT Dosovitskiy et al. (2021) having 1.843 billion parameters, or slightly older models such as EfficientNet-B7 Tan and Le (2019) having 813 layers.
5 Conclusion and Future Work
Improving the performance of Federated Learning on heterogeneous data is a widely researched topic. However, many previous works have incorrectly suggested that regularizing every layer of neural networks during local training is the best method to increase performance. We propose FedCKA, built on the most up-to-date understanding of neural networks. By regularizing naturally similar, but not naturally dissimilar layers during local training, performance improves beyond previous works. We also show that FedCKA is the only existing regularization method with adequate scalability when trained with a moderate sized model.
FedCKA shows that the proper regularization of important layers improves the performance of Federated Learning on heterogeneous data. However, standardizing the comparison of neural networks is an important step in a deeper understanding of neural networks. Moreover, there are questions as to the accuracy of CKA in measuring similarity in models such as Transformers or Graph Neural Networks. These are some topics we leave for future works.
- Federated learning of predictive models from federated electronic health records. International Journal of Medical Informatics 112, pp. 59–67. External Links: Cited by: §1.
- A simple framework for contrastive learning of visual representations. External Links: Cited by: §2.2, §3.3.
- An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, External Links: Cited by: §4.5.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §3.1, §4.1.
- Measuring the effects of non-identical data distribution for federated visual classification. External Links: Cited by: §4.1.
- Measuring the effects of non-identical data distribution for federated visual classification. External Links: Cited by: §1.
- Advances and open problems in federated learning. External Links: Cited by: §1.
SCAFFOLD: stochastic controlled averaging for federated learning.
Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 5132–5143. External Links: Cited by: §1, §2.2, §4.1.
- Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 3519–3529. External Links: Cited by: §2.1, §3.1, §3.1, §3.2, §3.2, §3.3, §3.3, §4.4.
- Learning multiple layers of features from tiny images. External Links: Cited by: §3.1, §4.1.
- Deep learning. Nature 521 (7553), pp. 436–444. External Links: Cited by: §1.
- Tiny imagenet. External Links: Cited by: §4.1.
- Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), pp. 10713–10722. Cited by: §1, §2.2, §4.1, §4.1, §4.2.
- Federated optimization in heterogeneous networks. External Links: Cited by: §1, §2.2, §4.1.
- Ensemble distillation for robust model fusion in federated learning. External Links: Cited by: §4.1.
- Communication-efficient learning of deep networks from decentralized data. In AISTATS, Cited by: §1, §2.2, §4.1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: §4.1.
- Distributed federated learning for ultra-reliable low-latency vehicular communications. IEEE Transactions on Communications 68 (2), pp. 1146–1159. External Links: Cited by: §1.
- The deep learning revolution. The MIT Press, Cambridge, Massachusetts. Cited by: §1.
- EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. External Links: Cited by: §4.5.
- A survey on distributed machine learning. External Links: Cited by: §1.
- Federated learning via over-the-air computation. IEEE Transactions on Wireless Communications 19 (3), pp. 2022–2035. External Links: Cited by: §1.
- Are all layers created equal?. In ICML 2019 Workshop on Identifying and Understanding Deep Learning Phenomena, Long Beach, California, United States. Cited by: §2.1, §3.2.