Log In Sign Up

Federated Learning with Position-Aware Neurons

Federated Learning (FL) fuses collaborative models from local nodes without centralizing users' data. The permutation invariance property of neural networks and the non-i.i.d. data across clients make the locally updated parameters imprecisely aligned, disabling the coordinate-based parameter averaging. Traditional neurons do not explicitly consider position information. Hence, we propose Position-Aware Neurons (PANs) as an alternative, fusing position-related values (i.e., position encodings) into neuron outputs. PANs couple themselves to their positions and minimize the possibility of dislocation, even updating on heterogeneous data. We turn on/off PANs to disable/enable the permutation invariance property of neural networks. PANs are tightly coupled with positions when applied to FL, making parameters across clients pre-aligned and facilitating coordinate-based parameter averaging. PANs are algorithm-agnostic and could universally improve existing FL algorithms. Furthermore, "FL with PANs" is simple to implement and computationally friendly.


page 6

page 7

page 12

page 13

page 14

page 15

page 16


Architecture Agnostic Federated Learning for Neural Networks

With growing concerns regarding data privacy and rapid increase in data ...

Loosely Coupled Federated Learning Over Generative Models

Federated learning (FL) was proposed to achieve collaborative machine le...

Knowledge Distillation for Federated Learning: a Practical Guide

Federated Learning (FL) enables the training of Deep Learning models wit...

Evaluation and comparison of federated learning algorithms for Human Activity Recognition on smartphones

Pervasive computing promotes the integration of smart devices in our liv...

Investigating Neuron Disturbing in Fusing Heterogeneous Neural Networks

Fusing deep learning models trained on separately located clients into a...

Federated Learning with Heterogeneous Architectures using Graph HyperNetworks

Standard Federated Learning (FL) techniques are limited to clients with ...

Bayesian Federated Neural Matching that Completes Full Information

Federated learning is a contemporary machine learning paradigm where loc...

1 Introduction

Federated Learning (FL) [Fed-Advances, Fed-Concept] generates a global model via collaborating with isolated clients for privacy protection and efficient distributed training, generally following the parameter server architecture [ParameterServer, LargeScaleNet]. Clients update models on their devices using private data, and the server periodically averages these models for multiple communication rounds [FedAvg]. The whole process does not transmit users’ data and meets the basic privacy requirements.

Figure 1: Left: Position-Aware Neurons (PANs). We fuse equal/varied position encodings to neurons’ outputs, PANs are turned off/on, and the shuffled networks make the same/different predictions, i.e., the permutation invariance property is enabled/disabled. Right: applying PANs to FL. Neurons are coupled with their positions for pre-alignment.

Represented by FedAvg [FedAvg], many FL algorithms aggregate local parameters via a simple coordinate-based averaging [FedProx, FedRS, MOON, FedPHP] These algorithms have two kinds of drawbacks. First, as traditional neurons are unaware of their positions, neural networks have the permutation invariance property, implying that hidden neurons could be dislocated during training without affecting the local performances. Second, the samples across clients are non-independent and identically distributed (non-i.i.d.) [NonIID-Quag], which could exacerbate the permutation of neural networks during local training, making local models misaligned and leading to weight divergence [Fed-NonIID-Data]. These reasons degrade the performance of coordinate-based parameter averaging.

Recently, a series of works utilize various matching techniques to align neurons, such as Bayesian nonparametric learning [BFNM, SMAPM, FedMA] and optimal transport [Barycenter, OTFusion]. First, these methods are too complex to implement. Second, they solve the misalignment problem after finishing local updates and hence belong to post-processing strategies that need additional computation budgets. Fed [Fed2] pioneers a novel aspect via designing feature-oriented model structures following a pre-aligned manner. However, it has to carefully customize the network architecture and only stays at the group level of pre-alignment. By contrast, we explore a more straightforward and general technique to pre-align neurons during local training procedures.

Our work mainly focuses on solving the non-i.i.d. challenge in FL, more specifically, seeking solutions via limiting the permutation invariance property of neural networks. We first summarize the above analysis: the permutation invariance property of neural networks leads to neuron misalignment across local models. The more heterogeneous the data, the more serious the misalignment is. Hence, our motivation is intuitive: could we design a switch to control the permutation invariance property of neuron networks? We propose Position-Aware Neurons (PANs) as the solution, which couple neurons with their positions. Specifically, for each neuron (channel for ConvNet [AlexNet, VGG, ResNet]), we add or multiply a position-related value (i.e., position encoding) to its output. We introduce a hyper-parameter to turn on/off the PANs, and correspondingly, to disable/enable the permutation invariance property of neural networks. PANs bind neurons in their positions, implicitly pre-aligning neurons across clients even faced with non-i.i.d. data. From another aspect, PANs could keep some consistent ingredients in the forward and backward pass across local models, which could reduce the weight divergence. Overall, appropriate PANs facilitate the coordinate-based parameter averaging in FL. Replacing traditional neurons with PANs is simple to implement and computationally friendly, which is universal to various FL algorithms. Contributions can be briefed as: (1) proposing PANs to disable/enable the permutation invariance property of deep networks; (2) applying PANs to FL, which binds neurons in positions and pre-aligns parameters for better coordinate-wise parameter averaging.

2 Related Works

FL with Non-I.I.D. Data: Existing works solve the non-i.i.d. data problem in FL from various aspects. [Fed-NonIID-Data] points out the weight divergence phenomenon in FL and use shared data to decrease the divergence. FedProx [FedProx] takes a proximal term during local training as regularization. FedOpt [FedOpt] considers updating the global model via momentum or adaptive optimizers (e.g., Adam [Adam], Yogi [Yogi]) instead of simple parameter averaging. Scaffold [Scaffold] introduces control variates to rectify the local update directions and mitigates the influences of client drift. MOON [MOON] utilizes model contrastive learning to reduce the distance between local and global models. Some other works utilize similar techniques including dynamic regularization [FedDyn], ensemble distillation [FedDF, OnlineDistill], etc. We take several representative FL algorithms and use PANs to improve them.

FL with Permutation Invariance Property: The permutation invariance of neuron networks could lead to neuron misalignment. PFNM [BFNM] matches local nodes’ parameters via Beta-Bernoulli process [BBP] and Indian Buffet Process [IBP], formulating an optimal assignment problem and solving it via Hungarian algorithm [Hungarian]. SPAHM [SMAPM]

applies the same procedure to aggregate Gaussian topic models, hidden Markov models, and so on. FedMA 

[FedMA] points out PFNM does not apply to large-scale networks and proposes a layer-wise matching method. [OTFusion] utilizes optimal transport [Barycenter] to fuse models with different initializations. These methods are all post-processing ones that need additional computation costs. Fed is recently proposed to align features during local training via separating features into different groups. However, it needs to carefully design the architectures. Differently, we take a more fine-grained alignment of neurons rather than network groups, and we will show our method is more general.

Position Encoding: Position encoding is popular in sequence learning architectures, e.g., ConvS2S [ConvS2S] and transformer [Transformer], etc. These architectures take position encodings to consider the order information. Relative position encoding [RPE] is more applicable to sequences with various lengths. Some other studies are devoted to interpreting what position encodings learn [PEInBERT, WhatPELearn]

. Another interesting work is applying position encodings instead of zero-padding to GAN 

[PEGAN] as spatial inductive bias. Differently, we resort to position encodings to bind neurons in their positions in FL. Furthermore, these works only consider position encodings at the input layer, while we couple them with neurons.

3 Position-Aware Neurons

In this section, we investigate the permutation invariance of neural networks and introduce PANs to control it.

3.1 Permutation Invariance Property

Assume an MLP network has layers (containing input and output layer), and each layer contains neurons, where is the layer index. and are input and output dimensions. We denote the parameters of each layer as the weight matrix

and the bias vector

, . The input layer does not have parameters. We use as the activations of the th layer. We have , where

is the element-wise activation function, e.g., ReLU 

[relu]. denotes no activation function in the output layer. Sometimes, we use to represent a network with only one hidden layer and the output dimension is one (called as MLP0), where , , , . We use as a permutation matrix that satisfies and . Easily, we have some properties: , , , where

is the identity matrix and

denotes Hadamard product. If is an element-wise function, .

For MLP0, we have , implying that if we permute the parameters properly, the output of a certain neural network does not change, i.e., the permutation invariance property. Extending it to MLP, the layer-wise permutation process is


where and , meaning that the input and output layers are not shuffled. For ConvNet [AlexNet, VGG], we take convolution kernels as basic units. The convolution parameters could be denoted as , where the four dimensions denote the number of output/input channels (, ) and the kernel size (, ). The permutation could be similarly applied as . For ResNet [ResNet], we use to permute all parameters in a basic block including the shortcut (if shortcut is not used, ).

3.2 Position-Aware Neurons

The essential reason for the permutation invariance of neural networks is that neurons have nothing to do with their positions. Hence, an intuitive improvement is fusing position-related values (position encodings) to neurons. We propose Position-Aware Neurons (PANs), adding or multiplying position encodings to neurons’ outputs, i.e.,


where denotes position encodings that are only related to positions and not learnable. We use “” and “” to represent additive and multiplicative PANs, respectively. We use sinusoidal functions to generate as commonly used in previous position encoding works [Transformer], i.e.,


where and respectively denotes the period and amplitude of position encodings, and is the position index of a neuron. For ConvNet, we assign position encodings for each channel, and is the channel index. Notably, if we take or , PANs degenerate into normal neurons. In practice, we only apply PANs to the hidden layers, while the input and output layers remain unchanged, i.e., for . With PANs, the permutation process in Eq. 1 could be reformulated as


where the subscript “sf” denotes “shuffled” (or permuted). To measure the output change after shuffling, we define the shuffle error as:


and this error on MLP0 without considering bias (i.e., ) is


where we take as the function of and take Taylor expansion as an approximation. Obviously, shuffle error is closely related to the strength of permutation, i.e., . For example, if , the network is not shuffled and the outputs are kept unchanged. Then, if we take equal values as position encodings, i.e., , the output also does not change because . This can be obtained via taking or . If we take a larger (e.g., 1) and larger (e.g., 0.05), Err is generally non-zero because . The error of multiplicative PANs is similar. We abstract PANs as a switch: if we take equal/varied position encodings, PANs are turned off/on, and hence the network keeps/loses the permutation invariance property (i.e., the same/different outputs after permutation). As illustrated at the left of Fig. 1, the five neurons of a certain hidden layer are shuffled while the position encodings they are going to add/multiply are not shuffled, and the outputs will change with PANs turned on.

Furthermore, are there any essential differences between additive and multiplicative PANs, and how much influence do they have on the shuffle error? In Eq. 9, the shuffle error is partially determined by , and we extent this gradient to MLP with multiple layers. We assume all layers have the same number of neurons (i.e., ) and take the same position encodings (i.e., ). We denote and obtain the recursive gradient expressions:



transforms a vector to a diagonal matrix and

repeats a vector times to obtain a matrix. denotes the gradient of activation functions, whose element is 0 or 1 in ReLU. If we expand Eq. 10 and Eq. 11 correspondingly, we will find that the gradient of additive PANs does not explicitly rely on . However, for the multiplicative one, is relevant to and , which could lead to a polynomial term (resulted from , informally). Hence, we conclude: taking PANs as a switch could control the permutation invariance property of neural networks. The designed multiplicative PANs will make this switch more sensitive.

4 FL with PANs

In this section, we briefly introduce FedAvg [FedAvg] and analyze the effects of PANs when applied to FL.

4.1 FedAvg

Suppose we have a server and clients with various data distributions. FedAvg first initializes a global model on the server. Then, a small fraction (i.e. ) of clients download the global model and update it on their local data for epochs, and then upload the updated model to the server. Then, the server takes a coordinate-based parameter averaging, i.e., . Next, will be sent down for a new communication round. This will be repeated for communication rounds. Because the parameters could be misaligned during local training, some works [BFNM, FedMA, SMAPM] are devoted to finding the correspondences between clients’ uploaded neurons for better aggregation. For example, the parameters and may be misaligned, and we should search for proper matrices to match them, i.e., , rather than  [OTFusion]. However, searching for appropriate is challenging. Generally, these works require additional data to search for proper alignment. In addition, the matching process has typically to solve complex optimization problems, such as optimal transport or optimal assignment, leading to additional computational overhead. An intuitive question is: could we pre-align the neurons during local training instead of post-matching?

4.2 Applying PANs to FL

Replacing traditional neurons with PANs in FL is straightforward to implement. Why does such a subtle improvement help? We heuristically expect PANs in FL could bring such effects:

PANs could limit the dislocation of neurons since the disturbance of them will bring significant changes to the outputs of the neural network and lead to higher training errors and fluctuations. Theoretically, the forward pass on the th client with PANs is as follows:


Notably, the position encodings are commonly utilized across clients, i.e., the forward pass across local clients share some consistent information. Then, the parameters’ gradient of Eq. 12 and Eq. 13 can be calculated by:


where we only give the gradient of bias for simplification. The gradients of multiplicative PANs directly contain the same position information across clients (e.g., ) in spite of various data distributions (e.g., ). For the additive ones, the impact of is implicit because is related to , but nevertheless, the effect is not significant as multiplicative ones. Overall, could regularize and rectify local gradient directions, keeping some ingredients consistent during backward propagation. As an extreme case, if in is very large, the gradients in Eq. 14 and Eq. 15 will tend to be the same, mitigating the weight divergence completely. However, setting too large will make the neural network difficult to train and the data information is completely covered, so the strength of (i.e., ) is a tradeoff.

Figure 2: Left: how much neurons are not shuffled with various . Right: a permutation matrix demo with .

5 Experiments

We study how much influence the proposed PANs have on both centralized training and decentralized training (i.e., FL). The datasets used are Mnist 

[mnist], FeMnist [LEAF]


[Svhn], GTSRB [GTSRB], Cifar10/100 [cifar], and Cinic10 [Cinic10]. FeMnist is recommended by LEAF [LEAF] and FedScale [FedScale]. We use MLP for Mnist/FeMnist, VGG [VGG] for SVHN/GTSRB/Cifar10, ResNet20 [ResNet] for Cifar100/Cinic10 by default if without more declarations. We sometimes take VGG9 used in previous FL works [FedMA, FedDF, Fed2]. For centralized training, we use the provided training and test set correspondingly. For FL, we split the training set according to Dirichlet distributions, where controls the non-i.i.d. level. Smaller leads to more non-i.i.d. cases. For each FL scene, we report several key hyper-parameters: number of clients , client participation ratio , number of local training epochs , Dirichlet alpha , number of communication rounds . For PANs, we report and . With , we turn off PANs, i.e., using traditional neurons or the baselines; with , we turn on PANs. We leave PANs turned on by default if with no mention of the state on/off or the value of . Details of datasets, networks and training are presented in Supp.

5.1 Centralized Training

Shuffle Test:

We first propose a procedure to measure the degree of permutation invariance of a certain neural network, that is, how large the shuffle error in Eq. 8 is after shuffling the neurons. We name this procedure shuffle test. Given a neural network and a batch of data, we first obtain the outputs. Then, we shuffle the neurons of hidden layers. The shuffle process is shown in Supp, where controls the disorder level of the constructed permutation matrices. Then we could get the outputs after shuffling and then calculate the shuffle error. We vary in and plot the ratio of permutation matrices’ diagonal ones (i.e., how much neurons are not shuffled). We denote this ratio as and plot them in Fig. 2 (average of 10 experiments), where we also show a generated permutation matrix with .

Figure 3: Left: shuffle error (Eq. 8) with various and (PAN). Right: the difference between PAN and PAN (=1). (VGG13 is used, more networks are in Supp.)

Shuffle Error with Random Data:

With different hyper-parameters of and in Eq. 4/Eq. 5

, we use random data generated from Gaussian distributions (i.e.,

) to calculate the shuffle error. The results based on VGG13 are shown in Fig. 3. The error is more related to while less sensitive to . This is intuitive because controls local volatility while neuron permutation could happen globally, e.g., the first neuron could swap positions with the last neuron. A larger leads to a larger shuffle error, i.e., the more serious the network loses the permutation invariance property. In addition, the shuffle error based on the additive PANs increases linearly, while that based on the multiplicative PANs increases quickly. This verifies the theoretical analysis in Sect. 3.2. However, in practice, a larger may cause training failure and we only set for additive PANs and for multiplicative PANs (the bold part on the right side of Fig. 3).

Figure 4: The first: test accuracy of models trained with different PANs. The other three: test accuracy change after manual permutation with various .

Influence on Inference:

We study the influence of PANs on test accuracies. We use MLP on Mnist, VGG13 on SVHN, and ResNet20 on Cifar10. We first train models with various PANs until convergence, and the model performances are shown in the first figure of Fig. 4. The horizontal dotted lines show the accuracies of normal networks, and the solid segments show the results of networks with various PANs. We find that introducing PANs to neural networks does not improve performances, but brings a slight degradation. That is, PANs could make the network somewhat harder to train. More studies of how PANs influence the network predictions could be found in Supp. Then, we investigate the shuffle error reflected by the change of test accuracies. Specifically, we shuffle the trained network to make predictions on the test set. We vary several groups of and for PANs. We show the results in the last three figures of Fig. 4. With larger , i.e., more neurons are shuffled, the test accuracy of the network with does not change (the permutation invariance property). However, larger leads to more significant performance degradation ( vs. for PAN; vs. for PAN). PAN makes the network more sensitive to shuffling than PAN (curves with “” degrades significantly). With different , the performance degradation is nearly the same, again showing that PANs are robust to . These verify the conclusions in Sect. 3.2. Overall, PANs work as a tradeoff between model performances and control of permutation invariance.

Figure 5: Top: how much neurons are not shuffled with various . Bottom: test accuracies of FL with various (dotted lines) and accuracies after manual shufflling on i.i.d. data () (red scatters).

5.2 Decentralized Training

Then we study the effects of introducing PANs to FL. We first present some empirical studies to verify the pre-alignment effects of PANs, and then show performances.

How many neurons are misaligned in FL?

Although some previous works [BFNM, FedMA, Fed2] declare that neurons could be dislocated when faced with non-i.i.d. data, they do not show this in evidence and do not show the degree of misalignment. We present a heuristic method: we manually shuffle the neurons during local training with i.i.d. data and study how much misalignment could cause the performance to drop to the same as training with non-i.i.d. data

. Specifically, during each client’s training step (each batch as a step), we shuffle the neurons with a probability

, where ,, are respectively the batch size, the number of local epochs, and the number of local data samples. In each shuffle process, we keep . determines how many times the network could be shuffled during local training. Larger means more neurons are shuffled upon finishing training, e.g., keeps approximately neurons not shuffled as shown in Fig. 5. The calculation of in Fig. 5 is presented in Supp. Then, we show the test accuracies of FedAvg [FedAvg] under various levels of non-i.i.d. data, i.e., . The results correspond to the three horizontal lines in the bottom three figures of Fig. 5. The scatters in red show the performances of shuffling neurons with various . Obviously, even with i.i.d. data, the larger the , the worse the performance. This implies that neuron misalignment could actually lead to performance degradation. Compared with non-i.i.d. performances, taking Cifar10 as an example, setting could make the i.i.d. (=10.0) performance degrade to the same as non-i.i.d. (=0.1), that is, approximately neurons are misaligned on each client. This may provide some enlightenment for the quantitative measure of how many neurons are misaligned in FL with non-i.i.d. data.

Figure 6: Weight divergence with PANs off/on. (, MLP on Mnist, more datasets’ results are in Supp.)
Figure 7: Optimal assignment matrix with PANs off/on, left vs. right. (, , VGG9 Conv5 on Cifar10, more results are in Supp.)

Do PANs indeed reduce the possibility of neuron misalignment?

We propose several strategies from aspects of parameters, activations, and preference vectors to compare the neuron correspondences in FL with PANs off/on. For PANs turned on, we use multiplicative PANs with and by default.

I. Weight Divergence: Weight divergence [Fed-NonIID-Data]

measures the variances of local parameters. Specifically, we calculate

for each layer . denotes the averaged parameters. The weight divergences of MLP on Mnist with are in Fig. 6, where PANs could reduce the divergences a lot (the red bars). This corresponds to the explanation in Sect. 4.2 that clients’ parameters are partially updated towards the same direction.

Figure 8: Preference vectors with PANs off/on, left vs. right. (, VGG9 Conv6 on Cifar10, more results are shown in Supp.)

II. Matching via Optimal Assignment: We feed 500 test samples into the network and obtain the activations of each neuron as its representation. Neurons’ representations of global and local model are denoted as and , where . Then we search for the optimal assignment matrix that minimizes and satisfies , . In fact, is a permutation matrix that could approximately reflect the disturbance of neurons, and it could match neurons with similar outputs. We plot the solved matching matrix in Fig. 7, where the number in “[]” shows the ratio of the diagonal ones. Using PANs could make the diagonal denser, implying that neurons at the same coordinates output similarly.

III. Visualizing Neurons via Preference Vectors: Then, we correspond neurons to classes via calculating preference vectors as done in [Fed2]. Specifically, we calculate for each class , and then concatenate all classes as the preference vector . denotes the activation value and is the prediction score of the th class. Then, implies which class the neuron contributes to more. The results are shown in Fig. 8, where each vertical line represents a neuron/channel. The number in “[]” shows how much neurons/channels correspond to the same class between global and local models. With PANs, the coordinate matching results are better. These empirical results verify the pre-alignment effects brought by PANs.

Figure 9: Comparison results on non-i.i.d. data (=0.1). Rows show datasets and columns show FL algorithms. PANs could universally improve these algorithms. (More datasets are shown in Supp.)
Settings () FedAvg FedProx FedMA Fed FedDF FedAvg FedAvg+PANs
() 86.29 85.32 84.0 (87.53, ) 88.29 - 86.83 88.490.07
() 78.34 78.60 65.0 - 80.36 79.76 81.940.09
Table 1: Comparison results with other popular FL algorithms on Cifar10 with VGG9. The left shows settings. The middle shows the cited results from FedMA [FedMA], Fed [Fed2], and FedDF [FedDF]. The last two columns show the results we implement.
() FedMA Fed FedAvg+PANs
() 83.91 82.26 85.82 0.16
() 48.25 81.23 82.87 0.21
Table 2: Comparison results with SOTA on more scenes. The results are all implemented by our reproduced code.

Do PANs bring performance improvement in FL?

We then compare the performances of FL with PANs off/on.

I. Universal Application of PANs: We first apply PANs to some popular FL algorithms as introduced in Sect. 2, including FedAvg [FedAvg], FedProx [FedProx], FedOpt [FedOpt], Scaffold [Scaffold], MOON [MOON]. These methods solve the non-i.i.d. problem from different aspects. Training details of these algorithms are provided in Supp. We add PANs to them and investigate the performance improvements on FeMnist, Cifar10, Cifar100, and Cinic10, where , , , , . We use as the baseline. Hyper-parameters are searched from three groups: PAN with , PAN with , PAN with , and the best result is reported in Fig. 9. PANs indeed improve these algorithms. With various non-i.i.d. levels of decentralized data, i.e., . We report the averaged accuracy of the last five communication rounds in Fig. 10 ( communication rounds with other hyper-parameters the same). Obviously, more non-i.i.d. scenes (smaller ) experience more significant improvements. This is related to the regularization effect as analyzed in Sect. 4.2. We also investigate the results with various numbers of clients and local training epochs, i.e., and . The results of FedAvg on Cifar10 and Cifar100 are shown in Fig. 11, where we take and . On average, introducing PANs could lead to about to improvement on various scenes. These studies verify that PANs could be universally and effectively applied to FL algorithms under various settings.

Figure 10: Comparisons under various levels of non-i.i.d. data on Cinic10. Smaller implies more non-i.i.d. data. (More datasets are shown in Supp.)
Figure 11: Comparisons under different FL scenes (, ) based on FedAvg. (Scaffold results are shown in Supp.)

II. Hyper-parameter Analysis: We first vary on Cifar10 and plot the results on the left of Fig. 12. We set and only report the results of multiplicative PANs. Setting around 0.1 could improve the performance a lot, while using larger experiences degradation, which is because neural networks become harder to train. This again shows that is a tradeoff between neuron pre-alignment and network performance. The proportions of the optimal hyper-parameters from the results of the above experiments are shown on the right of Fig. 12. Using in multiplicative PANs is a good choice. means turning off PANs, and its ratio is only about , which means turning on PANs is useful in most cases.

III. Comparing with SOTA: FedMA [FedMA] and Fed [Fed2] are representative works that solve the parameter alignment problems in FL. We collect the reported settings and results in FedMA, Fed, and FedDF [FedDF], and compare the performances under the same settings. We list the results on Cifar10 with VGG9 in Tab. 1, where the last three columns show our results. Although our reproduced FedAvg performs slightly better than the cited results, the performance gain via introducing PANs is remarkable. We then vary the settings of () from two aspects: (1) decreasing the non-i.i.d. from to , i.e., a more non-i.i.d. scene; (2) decreasing the client selection ratio from to , i.e., partial client participation. Aside from the above changes, other hyper-parameters are kept the same. We run the code provided by FedMA111 and reproduce Fed via our implementations. The results are listed in Tab. 2. FedMA performs especially worse under partial client participation. Fed also performs not so well. Our method surpasses the compared methods obviously in these cases. Furthermore, our method is more efficient, e.g., with four 10-core Intel(R) Xeon(R) Silver 4210R CPUs @ 2.40GHz and one NVIDIA GeForce RTX 3090 GPU card, FedMA needs about 4 hours for a single communication round while ours only requires several minutes.

IV. More Studies: We study using optimal transport to fuse neural networks with PANs as done in [OTFusion]. We also investigate the BatchNorm [BN] and GroupNorm [GN] used in VGG or ResNet, where PANs are more applicable to BatchNorm. We finally investigate some varieties of PANs for better personalization in FL [PersonalizeMAML]. These are provided in Supp.

Figure 12: Left: performance comparisons under various . Right: the distributions of optimal hyper-parameters.

V. Disadvantages: Fusing different values makes the magnitudes of neuron activations/gradients varied, which requires a customized neuron-aware optimizer. In supp, we try applying the adaptive optimizer Adam [Adam] to PANs, but we do not find too much improvement. Hence, advanced optimizers should be explored in future work.

6 Conclusions

We propose position-aware neurons (PANs) to disable/enable the permutation invariance property of neural networks. PANs bind themselves in their positions, making parameters pre-aligned in FL even faced with non-i.i.d. data and facilitating the coordinate-based parameter averaging. PANs keep the same position encodings across clients, making local training contains consistent ingredients. Abundant experimental studies verify the role of PANs in parameter alignment. Future works are to find an optimization method specifically suitable for PANs, and extend PANs to large-scale FL benchmarks or more scenarios that require parameter alignment.


This work is partially supported by National Natural Science Foundation of China (Grant No. 41901270), NSFC-NRF Joint Research Project under Grant 61861146001, and Natural Science Foundation of Jiangsu Province (Grant No. BK20190296). Thanks to Huawei Noah’s Ark Lab NetMIND Research Team and CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2021-014B). Thanks for Professor Yang Yang’s suggestions. Professor De-Chuan Zhan is the corresponding author.


Appendix A Dataset Details

The utilized datasets include Mnist [mnist], FeMnist [LEAF], SVHN [Svhn], GTSRB [GTSRB], Cifar10/100 [cifar], and Cinic10 [Cinic10]. We detail these datasets as follows.

  • Mnist [mnist]

    is a digit recognition dataset that contains 10 digits to classify. The raw set contains 60,000 samples for training and 10,000 samples for evaluation. The image size is


  • SVHN [Svhn] is the Street View House Number dataset which contains 10 numbers to classify. The raw set contains 73,257 samples for training and 26,032 samples for evaluation. The image size is .

  • GTSRB [GTSRB] is the German Traffic Recognition Benchmark with 43 traffic signs. The raw set contains 39,209 samples for training and 12,630 samples for evaluation. We resize the images to .

  • Cifar10 and Cifar100 [cifar] are subsets of the Tiny Images dataset and respectively have 10/100 classes to classify. They consist of 50,000 training images and 10,000 test images. The image size is .

  • Cinic10 [Cinic10]

    is a combination of Cifar10 and ImageNet 

    [ImageNet], which contains 10 classes. It contains 90,000 samples for training, validation, and test, respectively. We do not use the validation set. The image size is .

  • FeMnist [LEAF] is built by partitioning the data in Extended MNIST [EMNIST] based on the writer of the digit/character. There are 62 digits and characters in all. The total number of training samples is 805,263. There are 3,550 users, and each user owns 226.8 samples on average. We only use users (i.e., 355 users). For each user, we take of the samples to construct the global test set. We resize the images to .

For centralized training, we correspondingly use the training set and test set for the first six datasets. For FeMnist, we centralize users’ training samples as the training set. For decentralized training (i.e., FL), we split the training set of the first six datasets according to Dirichlet distributions as done in previous FL works [FedDF, NonIID-Quag, FedMA]. Specifically, we split the training set onto clients and each client’s label distribution is generated from . While for FeMnist, we directly take the 355 users as clients. Some of these datasets are utilized in previous FL works. For example, Cifar10/Cifar100/Cinic10 are recommended by FedML [FedML], and FeMnist is recommended by LEAF [LEAF].

Figure 13: Network architectures with PANs. “PE” denotes position encoding; “SC” denotes shortcut. For ResNet, we only show one convolution layer in the basic block and omit the BatchNorm layers for simplification.
1:  Input: parameters ; shuffle probability
2:  Generate-Permutation-Matrix: ,
3:  for each layer  do
4:     ,
5:  end for


1:  Input: number of neurons ; shuffle probability
2:  Initialize:
3:  for  do
4:     sample from
5:     if then
6:  end for
Algorithm 1 Shuffle Process
1:  Input: shuffle probability ; expected shuffle times ; number of local epochs ; batch size ; number of local samples
2:  for each client  do
3:     Calculate the number of local update steps:
4:     for each local step in  do
5:        if run the ShuffleProcess with shuffle probability
6:     end for
7:  end for
Algorithm 2 Shuffle Process in FL

Appendix B Network Details

We utilize MLP, VGG [VGG], ResNet [ResNet] in this paper. We detail their architectures as follows:

  • MLP

    denotes a multiple layer perceptron with four layers containing input and output layers. For Mnist and FeMnist, the input size is

    . MLP has the architecture: FC1(784, 1024), ReLU(), FC2(1024, 1024), ReLU(), FC3(1024, 1024), ReLU(), FC4(1024, ). denotes the number of classes.

  • VGG contains a series of networks with various layers. The paper of VGG [VGG]

    presents VGG11, VGG13, VGG16, and VGG19. We follow their architectures and report the configuration of VGG11 as an example: 64, M, 128, M, 256, 256, M, 512, 512, M, 512, 512, M. “M” denotes the max-pooling layer. VGG11 contains 8 convolution blocks and three fully-connected layers in 

    [VGG]. However, we only use one fully-connected layer for classification in this paper. VGG9 is commonly utilized in previous FL works [FedMA, FedDF], whose configuration is: 32, 64, M, 128, 128, M, 256, 256, M. We keep all the fully-connected layers in VGG9 for a fair comparison with other works. The three fully-connected layers in VGG9 are: FC(4096, 512), ReLU(), FC(512, 512), ReLU(), FC(512, ). We name the th convolution layer in VGG as “Conv”. We do not use BatchNorm [BN] in VGG by default.

  • ResNet

    introduces residual connections to plain neural networks. We take the Cifar versions used in the paper 

    [ResNet], i.e., ResNet20 with the basic block. We set the initial channel as 64 (i.e., the output channel of the first convolution layer), and take nine continual basic blocks with 64, 64, 64, 128, 128, 128, 256, 256, 256 channels, respectively. We add a fully-connected layer for classification. We use BatchNorm [BN] in ResNet20 and add it before ReLU activation.

For these networks with PANs, we plot the demos in Fig. 13. We add PE before the ReLU activation layer and after the BatchNorm layer. We show the formulations of additive PANs and multiplicative PANs in the table of Fig. 13.

Figure 14: Left: shuffle error with various and (PAN). Right: the difference between PAN and PAN (=1). (MLP)
Figure 15: Left: shuffle error with various and (PAN). Right: the difference between PAN and PAN (=1). (ResNet20)
Figure 16: Weight divergence with PANs off/on. (, MLP on Mnist.)
Figure 17: Weight divergence with PANs off/on. (, VGG9 on Cifar10.)

Appendix C Hyper-parameter Details

For both centralized training and decentralized training (i.e., FL), we take a constant learning rate without scheduling, although some works have pointed out decaying the learning rate will help in FL [FL-Schedule]. We take SGD with momentum 0.9 as the optimizer by default if without more declaration. For MLP and VGG networks, we set the learning rate as 0.05; for ResNet, we use 0.1. We respectively use a warm start with 100 training steps and 10 training steps for centralized training and decentralized training (during local training). We use batch size 10 for FeMnist and 64 for other datasets.

We use FedAvg [FedAvg], FedProx [FedProx], FedOpt [FedOpt], Scaffold [Scaffold], and MOON [MOON] as base FL algorithms. For all of these algorithms, we take communication rounds, and select clients during each round. Each client updates the global model on their private data for epochs. For FedProx, the regularization coefficient of the proximal term is tuned in and the best one is reported. For FedOpt, we take SGD with momentum 0.9 as the global optimizer, and tune the global learning rate in , which is similar to FedAvgM [FedAvgM]. We also try using Adam as the global optimizer and find the performances are not stable. For Scaffold, we use the implementation from the online page 222 For MOON, we set the coefficient of the contrastive loss as , which is recommended by the authors. We then replace the normal neurons with the proposed PANs to improve these algorithms. We keep by default and tune hyper-parameters from: PAN with , PAN with , PAN with .

Figure 18: Optimal assignment matrix with PANs off/on, left vs. right. (, , VGG9 Conv6 on Cifar10.)
Figure 19: Optimal assignment matrix with PANs off/on, left vs. right. (, , MLP FC3 on Mnist.)
Figure 20: Preference vectors with PANs off/on, left vs. right. (, VGG9 Conv5 on Cifar10.)
Figure 21: Preference vectors with PANs off/on, left vs. right. (, VGG9 Conv4 on Cifar10.)
Figure 22: Preference vectors with PANs off/on, left vs. right. (, MLP FC4 on Mnist.)
Figure 23: Preference vectors with PANs off/on, left vs. right. (, MLP FC3 on Mnist.)
Figure 24: Comparison results on non-i.i.d. data (=0.1). Rows show datasets and columns show FL algorithms. PANs could universally improve these algorithms.
Figure 25: Comparisons under various levels of non-i.i.d. data on Cifar10. Smaller implies more non-i.i.d. data.
Figure 26: Comparisons under various levels of non-i.i.d. data on Cifar100. Smaller implies more non-i.i.d. data.
Figure 27: Comparisons under different FL scenes (, ) based on Scaffold.
Figure 28: Hyper-parameter analysis on Cifar10 with VGG11.
Figure 29: Hyper-parameter analysis on Cifar100 with ResNet20.
FeMnist GTSRB SVHN Cifar10 Cifar100 Cinic10
MLP VGG9 VGG9 VGG11 ResNet20 ResNet20
SGD + Momentum=0.9 (LR in {0.05,0.1}) 53.39 86.96 89.93 84.57 70.82 82.76
Adam (LR=3e-4) 54.25 90.84 91.13 87.13 67.22 81.99
Table 3: The performances of centralized training with corresponding networks (without PANs), i.e., the upper bound of decentralized training (FL).
Figure 30: Performances of centralized training with PANs. The two parts respectively show the results of additive PANs and multiplicative PANs.

Appendix D Experimental Details

d.1 Shuffle Test and Shuffle Test in FL

We propose a procedure to measure the degree of permutation invariance of a certain neural network, that is, how large the shuffle error is after shuffling the neurons. The shuffle process is shown in Alg. 1, where controls the disorder level of the constructed permutation matrices. Some additional descriptions are: (1) the permutation matrix (PM) should be randomly generated and we don’t need to solve it; (2) PMs are introduced just to verify the property of PANs that they can disable the permutation invariance of neural networks, which is not used in our FedPAN algorithm; (3) the computational complexity is , requiring at most J swaps, which is very efficient to implement during simulation.

We introduce the shuffle test in the body of this paper. Specifically, we manually shuffle the network and study the output change, i.e., the shuffle error defined in the body. A hyper-parameter is used to control the disorder of permutation. Given a , we could generate a permutation matrix , then we calculate how many neurons are not shuffled via computing “=np.mean(np.diag())”. We use the functions provided in the Numpy 333 package. This is calculated and its correspondence to is shown in the body. The shuffle process is also applied to FL. Specifically, we present the Pseudo-Code in Alg. 2. Easily, the model will be shuffled for times during local training in expectation. Hence, we calculate the corresponding as the diagonal ones after several accumulative permutation, i.e., “=np.mean(np.diag())”, where , , and denote the generated permutation matrices in each local update step. We simulate the process for a single layer 10 times and calculate the averaged . We keep and show the relations of and in the body.

Appendix E Additional Experimental Results

Shuffle Error on Random Data: We investigate the shuffle error via taking the random data as input in the body, where we only present the results based on VGG13. We report similar results on MLP and ResNet20, which are shown in Fig. 14 and Fig. 15. Multiplicative PANs with a larger A make the network more sensitive to neuron permutation.

Weight Divergence: Our proposed PANs could decrease the weight divergence during FL. Specifically, we split the training data onto clients with and select all clients in each round, i.e., . We take communication rounds and then calculate the local gradient variance as an approximation. We vary the number of local epochs . We only report the results on Mnist with in the body. Additional results of Mnist with (Fig. 16) and Cifar10 with (Fig. 17) further verify that PANs could decrease the local gradient variance.

Matching via Optimal Assignment: We first train a global model via FL for communication rounds, where the scene contains clients with . Then, we randomly sample a local client and update the global model for epochs. Our goal is to search for a matrix to match the neurons of the global model and the updated one, i.e., the local model of this client. We then use 500 test samples to obtain the neuron’s activations as their representations. Hence, the optimal assignment problem could be solved and the assignment matrix is a permutation matrix. The results on various layers of VGG9 and MLP are shown in Fig. 18 and Fig. 19. Notably, the calculated matching ratio, i.e., the number in “[]”, is only an approximated value which represents how much neurons are shuffled. The absolute value (e.g., 0.062) does not represent the actual permutation during training.

Visualizing Neurons via Preference Vectors: Similarly, more of the visualization results via preference vectors of neurons are provided in Fig. 20, Fig. 21, Fig. 22, and Fig. 23. Notably, there are only 10 neurons in Fig. 22 because FC4 is the output layer with 10 classes. Using PANs could encourage neurons at the same position contribute to the same classes as much as possible.

Universal Application of PANs: We report the results of applying PANs to popular FL algorithms on FeMnist, Cifar10, Cifar100, and Cinic10 in the body. We show the results on SVHN and GTSRB in Fig. 24. Training on GTSRB is not stable, and some algorithms will converge slower, e.g., FedAvg and FedOpt. This could be improved with the additional effort of tuning learning rates, while we omit this in this paper. Comparison results on Cifar10 and Cifar100 under various levels of non-i.i.d. data are shown in Fig. 25 and Fig. 26. The improvements under various scenes based on Scaffold are shown in Fig. 27. These additional results further verify the universal application of PANs to improve the performance of FL.

Hyper-parameter Analysis: We present the performances of various with PAN when in the body and point out that setting is a good choice. Here, we present a more comprehensive analysis with both additive and multiplicative PANs. The used FL scene is: , , , , . We plot the results on Cifar10 with VGG11 and Cifar100 with ResNet20 in Fig 28 and Fig. 29. The leftmost point shows the baseline of the performance. The four parts in different colors show the results with various or , while the other one is fixed. For example, the first part shows the performances with in PAN, while is fixed to 0.05. Clearly, with fixed , a larger leads to degradation (the green and the red part). Setting around 0.1 for PAN is recommended. The results on Cifar100 are more invariant to , although the performances fluctuate a lot on Cifar10. Many of these hyper-parameters could surpass the baseline.

Figure 31:

Model fusion of MLP on Mnist (Left) and VGG9 on Cifar10 (Right) with direct parameter averaging, optimal transport, and PANs. The x-axis shows the interpolation coefficient.

Figure 32: Comparisons of different normalization techniques in ConvNet. The top is based on VGG11 and the bottom is based on ResNet20. We use datasets Cifar10 and Cifar100.

Appendix F More Studies

f.1 Centralized Training

We report the test accuracies of centralized training on FeMnist, GTSRB, SVHN, Cifar10, Cifar100, and Cinic10. The utilized networks are correspondingly MLP, VGG9, VGG9, VGG11, ResNet20, and ResNet20. The numbers of training epochs are respectively 30, 20, 30, 30, 100, and 100. We utilize both SGD with momentum 0.9 and Adam as the optimizer. For SGD, we use 0.05 as the learning rate for MLP and VGG, while 0.1 for ResNet20. For Adam, we use 0.0003 for all networks. The performances are listed in Tab. 3. We then add PANs to some datasets and find that the performances degrade slightly. We vary the hyper-parameter in PANs while keeping . The results are shown in Fig. 30. Using PANs could harm the training process slightly, and commonly, a larger could make the results worse. Although we try utilizing the adaptive optimizer (i.e., Adam), the results of utilizing PANs do not improve. Advanced optimizers should be proposed to mitigate the degradation, which is left for future work.

f.2 Optimal Transport for Model Fusion

FL should send down the global model to local clients as the initialization during each communication round. If not, coordinate-based parameter averaging will become worse. The work [OTFusion] studies model fusion with different initializations, and utilizes optimal transport [Barycenter] to align model parameters. We split Mnist and Cifar10 into two parts uniformly. We train independent models on these two sets correspondingly. The obtained models after training epochs are denoted as and . Then, an interpolation is evaluated, i.e., , . Directly averaging these two models will perform poorly, which is shown in Fig 31 (the line with legend “Avg”). If we align the models via optimal transport and then interpolate the aligned models, the results become better (the line with legend “OT+Avg” in Fig 31). We further add PANs during model training and the performances could be slightly improved (the line with legend “PANs+OT+Avg” in Fig 31). This shows that PANs may still be helpful with different initializations.

f.3 BatchNorm vs. GroupNorm

We then investigate the normalization techniques in deep neural networks. Previous FL works point out that GroupNorm may be more applicable to FL with non-iid data [NonIID-Quag]. Specifically, BatchNorm calculates the mean and variance of a data batch, which is relevant to local training data. Hence, the statistical information in BatchNorm will diverge a lot across clients. One solution is aggregating the statistical information during FL, i.e., averaging the “running mean” and “running variance” in BatchNorm. We denote this as “BN-Y”. In contrast, we use “BY-N” to represent the method that “running mean” and “running variance” are not aggregated. We also vary the number of groups in GroupNorm, i.e., , which are denoted as “GN1”, “GN2”, “GN8”, and “GN32”. We list the convergence curves on Cifar10 and Cifar100 in Fig. 32. We use VGG11 and ResNet20 as the backbone. The numbers in the legends denote the final test accuracies. GroupNorm only improves the performances of Cifar10 with VGG11. Additionally, setting the number of groups as 1 is better. We also apply PANs to networks with “GN1” and find the performance does not improve. The combination of PANs with various normalization techniques is also interesting, which is also left for future work.

f.4 Personalization in FL

Finally, we present some possible varieties of PANs for personalization in FL. In the body of this paper, we take the same position encodings among clients and implicitly make neurons combined with their positions. However, if we take different position encodings or partially shared position encodings among clients, we could let similar clients contribute more. Some clients own individual positions, which could be utilized for personalization. These ideas are also left for future work.