1 Introduction
Federated Learning (FL) [FedAdvances, FedConcept] generates a global model via collaborating with isolated clients for privacy protection and efficient distributed training, generally following the parameter server architecture [ParameterServer, LargeScaleNet]. Clients update models on their devices using private data, and the server periodically averages these models for multiple communication rounds [FedAvg]. The whole process does not transmit users’ data and meets the basic privacy requirements.
Represented by FedAvg [FedAvg], many FL algorithms aggregate local parameters via a simple coordinatebased averaging [FedProx, FedRS, MOON, FedPHP] These algorithms have two kinds of drawbacks. First, as traditional neurons are unaware of their positions, neural networks have the permutation invariance property, implying that hidden neurons could be dislocated during training without affecting the local performances. Second, the samples across clients are nonindependent and identically distributed (noni.i.d.) [NonIIDQuag], which could exacerbate the permutation of neural networks during local training, making local models misaligned and leading to weight divergence [FedNonIIDData]. These reasons degrade the performance of coordinatebased parameter averaging.
Recently, a series of works utilize various matching techniques to align neurons, such as Bayesian nonparametric learning [BFNM, SMAPM, FedMA] and optimal transport [Barycenter, OTFusion]. First, these methods are too complex to implement. Second, they solve the misalignment problem after finishing local updates and hence belong to postprocessing strategies that need additional computation budgets. Fed [Fed2] pioneers a novel aspect via designing featureoriented model structures following a prealigned manner. However, it has to carefully customize the network architecture and only stays at the group level of prealignment. By contrast, we explore a more straightforward and general technique to prealign neurons during local training procedures.
Our work mainly focuses on solving the noni.i.d. challenge in FL, more specifically, seeking solutions via limiting the permutation invariance property of neural networks. We first summarize the above analysis: the permutation invariance property of neural networks leads to neuron misalignment across local models. The more heterogeneous the data, the more serious the misalignment is. Hence, our motivation is intuitive: could we design a switch to control the permutation invariance property of neuron networks? We propose PositionAware Neurons (PANs) as the solution, which couple neurons with their positions. Specifically, for each neuron (channel for ConvNet [AlexNet, VGG, ResNet]), we add or multiply a positionrelated value (i.e., position encoding) to its output. We introduce a hyperparameter to turn on/off the PANs, and correspondingly, to disable/enable the permutation invariance property of neural networks. PANs bind neurons in their positions, implicitly prealigning neurons across clients even faced with noni.i.d. data. From another aspect, PANs could keep some consistent ingredients in the forward and backward pass across local models, which could reduce the weight divergence. Overall, appropriate PANs facilitate the coordinatebased parameter averaging in FL. Replacing traditional neurons with PANs is simple to implement and computationally friendly, which is universal to various FL algorithms. Contributions can be briefed as: (1) proposing PANs to disable/enable the permutation invariance property of deep networks; (2) applying PANs to FL, which binds neurons in positions and prealigns parameters for better coordinatewise parameter averaging.
2 Related Works
FL with NonI.I.D. Data: Existing works solve the noni.i.d. data problem in FL from various aspects. [FedNonIIDData] points out the weight divergence phenomenon in FL and use shared data to decrease the divergence. FedProx [FedProx] takes a proximal term during local training as regularization. FedOpt [FedOpt] considers updating the global model via momentum or adaptive optimizers (e.g., Adam [Adam], Yogi [Yogi]) instead of simple parameter averaging. Scaffold [Scaffold] introduces control variates to rectify the local update directions and mitigates the influences of client drift. MOON [MOON] utilizes model contrastive learning to reduce the distance between local and global models. Some other works utilize similar techniques including dynamic regularization [FedDyn], ensemble distillation [FedDF, OnlineDistill], etc. We take several representative FL algorithms and use PANs to improve them.
FL with Permutation Invariance Property: The permutation invariance of neuron networks could lead to neuron misalignment. PFNM [BFNM] matches local nodes’ parameters via BetaBernoulli process [BBP] and Indian Buffet Process [IBP], formulating an optimal assignment problem and solving it via Hungarian algorithm [Hungarian]. SPAHM [SMAPM]
applies the same procedure to aggregate Gaussian topic models, hidden Markov models, and so on. FedMA
[FedMA] points out PFNM does not apply to largescale networks and proposes a layerwise matching method. [OTFusion] utilizes optimal transport [Barycenter] to fuse models with different initializations. These methods are all postprocessing ones that need additional computation costs. Fed is recently proposed to align features during local training via separating features into different groups. However, it needs to carefully design the architectures. Differently, we take a more finegrained alignment of neurons rather than network groups, and we will show our method is more general.Position Encoding: Position encoding is popular in sequence learning architectures, e.g., ConvS2S [ConvS2S] and transformer [Transformer], etc. These architectures take position encodings to consider the order information. Relative position encoding [RPE] is more applicable to sequences with various lengths. Some other studies are devoted to interpreting what position encodings learn [PEInBERT, WhatPELearn]
. Another interesting work is applying position encodings instead of zeropadding to GAN
[PEGAN] as spatial inductive bias. Differently, we resort to position encodings to bind neurons in their positions in FL. Furthermore, these works only consider position encodings at the input layer, while we couple them with neurons.3 PositionAware Neurons
In this section, we investigate the permutation invariance of neural networks and introduce PANs to control it.
3.1 Permutation Invariance Property
Assume an MLP network has layers (containing input and output layer), and each layer contains neurons, where is the layer index. and are input and output dimensions. We denote the parameters of each layer as the weight matrix
and the bias vector
, . The input layer does not have parameters. We use as the activations of the th layer. We have , whereis the elementwise activation function, e.g., ReLU
[relu]. denotes no activation function in the output layer. Sometimes, we use to represent a network with only one hidden layer and the output dimension is one (called as MLP0), where , , , . We use as a permutation matrix that satisfies and . Easily, we have some properties: , , , whereis the identity matrix and
denotes Hadamard product. If is an elementwise function, .For MLP0, we have , implying that if we permute the parameters properly, the output of a certain neural network does not change, i.e., the permutation invariance property. Extending it to MLP, the layerwise permutation process is
(1) 
where and , meaning that the input and output layers are not shuffled. For ConvNet [AlexNet, VGG], we take convolution kernels as basic units. The convolution parameters could be denoted as , where the four dimensions denote the number of output/input channels (, ) and the kernel size (, ). The permutation could be similarly applied as . For ResNet [ResNet], we use to permute all parameters in a basic block including the shortcut (if shortcut is not used, ).
3.2 PositionAware Neurons
The essential reason for the permutation invariance of neural networks is that neurons have nothing to do with their positions. Hence, an intuitive improvement is fusing positionrelated values (position encodings) to neurons. We propose PositionAware Neurons (PANs), adding or multiplying position encodings to neurons’ outputs, i.e.,
(2)  
(3) 
where denotes position encodings that are only related to positions and not learnable. We use “” and “” to represent additive and multiplicative PANs, respectively. We use sinusoidal functions to generate as commonly used in previous position encoding works [Transformer], i.e.,
(4)  
(5) 
where and respectively denotes the period and amplitude of position encodings, and is the position index of a neuron. For ConvNet, we assign position encodings for each channel, and is the channel index. Notably, if we take or , PANs degenerate into normal neurons. In practice, we only apply PANs to the hidden layers, while the input and output layers remain unchanged, i.e., for . With PANs, the permutation process in Eq. 1 could be reformulated as
(6)  
(7) 
where the subscript “sf” denotes “shuffled” (or permuted). To measure the output change after shuffling, we define the shuffle error as:
(8) 
and this error on MLP0 without considering bias (i.e., ) is
(9) 
where we take as the function of and take Taylor expansion as an approximation. Obviously, shuffle error is closely related to the strength of permutation, i.e., . For example, if , the network is not shuffled and the outputs are kept unchanged. Then, if we take equal values as position encodings, i.e., , the output also does not change because . This can be obtained via taking or . If we take a larger (e.g., 1) and larger (e.g., 0.05), Err is generally nonzero because . The error of multiplicative PANs is similar. We abstract PANs as a switch: if we take equal/varied position encodings, PANs are turned off/on, and hence the network keeps/loses the permutation invariance property (i.e., the same/different outputs after permutation). As illustrated at the left of Fig. 1, the five neurons of a certain hidden layer are shuffled while the position encodings they are going to add/multiply are not shuffled, and the outputs will change with PANs turned on.
Furthermore, are there any essential differences between additive and multiplicative PANs, and how much influence do they have on the shuffle error? In Eq. 9, the shuffle error is partially determined by , and we extent this gradient to MLP with multiple layers. We assume all layers have the same number of neurons (i.e., ) and take the same position encodings (i.e., ). We denote and obtain the recursive gradient expressions:
(10)  
(11) 
where
transforms a vector to a diagonal matrix and
repeats a vector times to obtain a matrix. denotes the gradient of activation functions, whose element is 0 or 1 in ReLU. If we expand Eq. 10 and Eq. 11 correspondingly, we will find that the gradient of additive PANs does not explicitly rely on . However, for the multiplicative one, is relevant to and , which could lead to a polynomial term (resulted from , informally). Hence, we conclude: taking PANs as a switch could control the permutation invariance property of neural networks. The designed multiplicative PANs will make this switch more sensitive.4 FL with PANs
In this section, we briefly introduce FedAvg [FedAvg] and analyze the effects of PANs when applied to FL.
4.1 FedAvg
Suppose we have a server and clients with various data distributions. FedAvg first initializes a global model on the server. Then, a small fraction (i.e. ) of clients download the global model and update it on their local data for epochs, and then upload the updated model to the server. Then, the server takes a coordinatebased parameter averaging, i.e., . Next, will be sent down for a new communication round. This will be repeated for communication rounds. Because the parameters could be misaligned during local training, some works [BFNM, FedMA, SMAPM] are devoted to finding the correspondences between clients’ uploaded neurons for better aggregation. For example, the parameters and may be misaligned, and we should search for proper matrices to match them, i.e., , rather than [OTFusion]. However, searching for appropriate is challenging. Generally, these works require additional data to search for proper alignment. In addition, the matching process has typically to solve complex optimization problems, such as optimal transport or optimal assignment, leading to additional computational overhead. An intuitive question is: could we prealign the neurons during local training instead of postmatching?
4.2 Applying PANs to FL
Replacing traditional neurons with PANs in FL is straightforward to implement. Why does such a subtle improvement help? We heuristically expect PANs in FL could bring such effects:
PANs could limit the dislocation of neurons since the disturbance of them will bring significant changes to the outputs of the neural network and lead to higher training errors and fluctuations. Theoretically, the forward pass on the th client with PANs is as follows:(12)  
(13) 
Notably, the position encodings are commonly utilized across clients, i.e., the forward pass across local clients share some consistent information. Then, the parameters’ gradient of Eq. 12 and Eq. 13 can be calculated by:
(14)  
(15) 
where we only give the gradient of bias for simplification. The gradients of multiplicative PANs directly contain the same position information across clients (e.g., ) in spite of various data distributions (e.g., ). For the additive ones, the impact of is implicit because is related to , but nevertheless, the effect is not significant as multiplicative ones. Overall, could regularize and rectify local gradient directions, keeping some ingredients consistent during backward propagation. As an extreme case, if in is very large, the gradients in Eq. 14 and Eq. 15 will tend to be the same, mitigating the weight divergence completely. However, setting too large will make the neural network difficult to train and the data information is completely covered, so the strength of (i.e., ) is a tradeoff.
5 Experiments
We study how much influence the proposed PANs have on both centralized training and decentralized training (i.e., FL). The datasets used are Mnist
[mnist], FeMnist [LEAF], SVHN
[Svhn], GTSRB [GTSRB], Cifar10/100 [cifar], and Cinic10 [Cinic10]. FeMnist is recommended by LEAF [LEAF] and FedScale [FedScale]. We use MLP for Mnist/FeMnist, VGG [VGG] for SVHN/GTSRB/Cifar10, ResNet20 [ResNet] for Cifar100/Cinic10 by default if without more declarations. We sometimes take VGG9 used in previous FL works [FedMA, FedDF, Fed2]. For centralized training, we use the provided training and test set correspondingly. For FL, we split the training set according to Dirichlet distributions, where controls the noni.i.d. level. Smaller leads to more noni.i.d. cases. For each FL scene, we report several key hyperparameters: number of clients , client participation ratio , number of local training epochs , Dirichlet alpha , number of communication rounds . For PANs, we report and . With , we turn off PANs, i.e., using traditional neurons or the baselines; with , we turn on PANs. We leave PANs turned on by default if with no mention of the state on/off or the value of . Details of datasets, networks and training are presented in Supp.5.1 Centralized Training
Shuffle Test:
We first propose a procedure to measure the degree of permutation invariance of a certain neural network, that is, how large the shuffle error in Eq. 8 is after shuffling the neurons. We name this procedure shuffle test. Given a neural network and a batch of data, we first obtain the outputs. Then, we shuffle the neurons of hidden layers. The shuffle process is shown in Supp, where controls the disorder level of the constructed permutation matrices. Then we could get the outputs after shuffling and then calculate the shuffle error. We vary in and plot the ratio of permutation matrices’ diagonal ones (i.e., how much neurons are not shuffled). We denote this ratio as and plot them in Fig. 2 (average of 10 experiments), where we also show a generated permutation matrix with .
Shuffle Error with Random Data:
With different hyperparameters of and in Eq. 4/Eq. 5
, we use random data generated from Gaussian distributions (i.e.,
) to calculate the shuffle error. The results based on VGG13 are shown in Fig. 3. The error is more related to while less sensitive to . This is intuitive because controls local volatility while neuron permutation could happen globally, e.g., the first neuron could swap positions with the last neuron. A larger leads to a larger shuffle error, i.e., the more serious the network loses the permutation invariance property. In addition, the shuffle error based on the additive PANs increases linearly, while that based on the multiplicative PANs increases quickly. This verifies the theoretical analysis in Sect. 3.2. However, in practice, a larger may cause training failure and we only set for additive PANs and for multiplicative PANs (the bold part on the right side of Fig. 3).Influence on Inference:
We study the influence of PANs on test accuracies. We use MLP on Mnist, VGG13 on SVHN, and ResNet20 on Cifar10. We first train models with various PANs until convergence, and the model performances are shown in the first figure of Fig. 4. The horizontal dotted lines show the accuracies of normal networks, and the solid segments show the results of networks with various PANs. We find that introducing PANs to neural networks does not improve performances, but brings a slight degradation. That is, PANs could make the network somewhat harder to train. More studies of how PANs influence the network predictions could be found in Supp. Then, we investigate the shuffle error reflected by the change of test accuracies. Specifically, we shuffle the trained network to make predictions on the test set. We vary several groups of and for PANs. We show the results in the last three figures of Fig. 4. With larger , i.e., more neurons are shuffled, the test accuracy of the network with does not change (the permutation invariance property). However, larger leads to more significant performance degradation ( vs. for PAN; vs. for PAN). PAN makes the network more sensitive to shuffling than PAN (curves with “” degrades significantly). With different , the performance degradation is nearly the same, again showing that PANs are robust to . These verify the conclusions in Sect. 3.2. Overall, PANs work as a tradeoff between model performances and control of permutation invariance.
5.2 Decentralized Training
Then we study the effects of introducing PANs to FL. We first present some empirical studies to verify the prealignment effects of PANs, and then show performances.
How many neurons are misaligned in FL?
Although some previous works [BFNM, FedMA, Fed2] declare that neurons could be dislocated when faced with noni.i.d. data, they do not show this in evidence and do not show the degree of misalignment. We present a heuristic method: we manually shuffle the neurons during local training with i.i.d. data and study how much misalignment could cause the performance to drop to the same as training with noni.i.d. data
. Specifically, during each client’s training step (each batch as a step), we shuffle the neurons with a probability
, where ,, are respectively the batch size, the number of local epochs, and the number of local data samples. In each shuffle process, we keep . determines how many times the network could be shuffled during local training. Larger means more neurons are shuffled upon finishing training, e.g., keeps approximately neurons not shuffled as shown in Fig. 5. The calculation of in Fig. 5 is presented in Supp. Then, we show the test accuracies of FedAvg [FedAvg] under various levels of noni.i.d. data, i.e., . The results correspond to the three horizontal lines in the bottom three figures of Fig. 5. The scatters in red show the performances of shuffling neurons with various . Obviously, even with i.i.d. data, the larger the , the worse the performance. This implies that neuron misalignment could actually lead to performance degradation. Compared with noni.i.d. performances, taking Cifar10 as an example, setting could make the i.i.d. (=10.0) performance degrade to the same as noni.i.d. (=0.1), that is, approximately neurons are misaligned on each client. This may provide some enlightenment for the quantitative measure of how many neurons are misaligned in FL with noni.i.d. data.Do PANs indeed reduce the possibility of neuron misalignment?
We propose several strategies from aspects of parameters, activations, and preference vectors to compare the neuron correspondences in FL with PANs off/on. For PANs turned on, we use multiplicative PANs with and by default.
I. Weight Divergence: Weight divergence [FedNonIIDData]
measures the variances of local parameters. Specifically, we calculate
for each layer . denotes the averaged parameters. The weight divergences of MLP on Mnist with are in Fig. 6, where PANs could reduce the divergences a lot (the red bars). This corresponds to the explanation in Sect. 4.2 that clients’ parameters are partially updated towards the same direction.II. Matching via Optimal Assignment: We feed 500 test samples into the network and obtain the activations of each neuron as its representation. Neurons’ representations of global and local model are denoted as and , where . Then we search for the optimal assignment matrix that minimizes and satisfies , . In fact, is a permutation matrix that could approximately reflect the disturbance of neurons, and it could match neurons with similar outputs. We plot the solved matching matrix in Fig. 7, where the number in “[]” shows the ratio of the diagonal ones. Using PANs could make the diagonal denser, implying that neurons at the same coordinates output similarly.
III. Visualizing Neurons via Preference Vectors: Then, we correspond neurons to classes via calculating preference vectors as done in [Fed2]. Specifically, we calculate for each class , and then concatenate all classes as the preference vector . denotes the activation value and is the prediction score of the th class. Then, implies which class the neuron contributes to more. The results are shown in Fig. 8, where each vertical line represents a neuron/channel. The number in “[]” shows how much neurons/channels correspond to the same class between global and local models. With PANs, the coordinate matching results are better. These empirical results verify the prealignment effects brought by PANs.
Settings ()  FedAvg  FedProx  FedMA  Fed  FedDF  FedAvg  FedAvg+PANs 

()  86.29  85.32  84.0 (87.53, )  88.29    86.83  88.490.07 
()  78.34  78.60  65.0    80.36  79.76  81.940.09 
()  FedMA  Fed  FedAvg+PANs 

()  83.91  82.26  85.82 0.16 
()  48.25  81.23  82.87 0.21 
Do PANs bring performance improvement in FL?
We then compare the performances of FL with PANs off/on.
I. Universal Application of PANs: We first apply PANs to some popular FL algorithms as introduced in Sect. 2, including FedAvg [FedAvg], FedProx [FedProx], FedOpt [FedOpt], Scaffold [Scaffold], MOON [MOON]. These methods solve the noni.i.d. problem from different aspects. Training details of these algorithms are provided in Supp. We add PANs to them and investigate the performance improvements on FeMnist, Cifar10, Cifar100, and Cinic10, where , , , , . We use as the baseline. Hyperparameters are searched from three groups: PAN with , PAN with , PAN with , and the best result is reported in Fig. 9. PANs indeed improve these algorithms. With various noni.i.d. levels of decentralized data, i.e., . We report the averaged accuracy of the last five communication rounds in Fig. 10 ( communication rounds with other hyperparameters the same). Obviously, more noni.i.d. scenes (smaller ) experience more significant improvements. This is related to the regularization effect as analyzed in Sect. 4.2. We also investigate the results with various numbers of clients and local training epochs, i.e., and . The results of FedAvg on Cifar10 and Cifar100 are shown in Fig. 11, where we take and . On average, introducing PANs could lead to about to improvement on various scenes. These studies verify that PANs could be universally and effectively applied to FL algorithms under various settings.
II. Hyperparameter Analysis: We first vary on Cifar10 and plot the results on the left of Fig. 12. We set and only report the results of multiplicative PANs. Setting around 0.1 could improve the performance a lot, while using larger experiences degradation, which is because neural networks become harder to train. This again shows that is a tradeoff between neuron prealignment and network performance. The proportions of the optimal hyperparameters from the results of the above experiments are shown on the right of Fig. 12. Using in multiplicative PANs is a good choice. means turning off PANs, and its ratio is only about , which means turning on PANs is useful in most cases.
III. Comparing with SOTA: FedMA [FedMA] and Fed [Fed2] are representative works that solve the parameter alignment problems in FL. We collect the reported settings and results in FedMA, Fed, and FedDF [FedDF], and compare the performances under the same settings. We list the results on Cifar10 with VGG9 in Tab. 1, where the last three columns show our results. Although our reproduced FedAvg performs slightly better than the cited results, the performance gain via introducing PANs is remarkable. We then vary the settings of () from two aspects: (1) decreasing the noni.i.d. from to , i.e., a more noni.i.d. scene; (2) decreasing the client selection ratio from to , i.e., partial client participation. Aside from the above changes, other hyperparameters are kept the same. We run the code provided by FedMA^{1}^{1}1https://github.com/IBM/FedMA and reproduce Fed via our implementations. The results are listed in Tab. 2. FedMA performs especially worse under partial client participation. Fed also performs not so well. Our method surpasses the compared methods obviously in these cases. Furthermore, our method is more efficient, e.g., with four 10core Intel(R) Xeon(R) Silver 4210R CPUs @ 2.40GHz and one NVIDIA GeForce RTX 3090 GPU card, FedMA needs about 4 hours for a single communication round while ours only requires several minutes.
IV. More Studies: We study using optimal transport to fuse neural networks with PANs as done in [OTFusion]. We also investigate the BatchNorm [BN] and GroupNorm [GN] used in VGG or ResNet, where PANs are more applicable to BatchNorm. We finally investigate some varieties of PANs for better personalization in FL [PersonalizeMAML]. These are provided in Supp.
V. Disadvantages: Fusing different values makes the magnitudes of neuron activations/gradients varied, which requires a customized neuronaware optimizer. In supp, we try applying the adaptive optimizer Adam [Adam] to PANs, but we do not find too much improvement. Hence, advanced optimizers should be explored in future work.
6 Conclusions
We propose positionaware neurons (PANs) to disable/enable the permutation invariance property of neural networks. PANs bind themselves in their positions, making parameters prealigned in FL even faced with noni.i.d. data and facilitating the coordinatebased parameter averaging. PANs keep the same position encodings across clients, making local training contains consistent ingredients. Abundant experimental studies verify the role of PANs in parameter alignment. Future works are to find an optimization method specifically suitable for PANs, and extend PANs to largescale FL benchmarks or more scenarios that require parameter alignment.
Acknowledgements
This work is partially supported by National Natural Science Foundation of China (Grant No. 41901270), NSFCNRF Joint Research Project under Grant 61861146001, and Natural Science Foundation of Jiangsu Province (Grant No. BK20190296). Thanks to Huawei Noah’s Ark Lab NetMIND Research Team and CAAIHuawei MindSpore Open Fund (CAAIXSJLJJ2021014B). Thanks for Professor Yang Yang’s suggestions. Professor DeChuan Zhan is the corresponding author.
References
Appendix A Dataset Details
The utilized datasets include Mnist [mnist], FeMnist [LEAF], SVHN [Svhn], GTSRB [GTSRB], Cifar10/100 [cifar], and Cinic10 [Cinic10]. We detail these datasets as follows.

Mnist [mnist]
is a digit recognition dataset that contains 10 digits to classify. The raw set contains 60,000 samples for training and 10,000 samples for evaluation. The image size is
. 
SVHN [Svhn] is the Street View House Number dataset which contains 10 numbers to classify. The raw set contains 73,257 samples for training and 26,032 samples for evaluation. The image size is .

GTSRB [GTSRB] is the German Traffic Recognition Benchmark with 43 traffic signs. The raw set contains 39,209 samples for training and 12,630 samples for evaluation. We resize the images to .

Cifar10 and Cifar100 [cifar] are subsets of the Tiny Images dataset and respectively have 10/100 classes to classify. They consist of 50,000 training images and 10,000 test images. The image size is .

FeMnist [LEAF] is built by partitioning the data in Extended MNIST [EMNIST] based on the writer of the digit/character. There are 62 digits and characters in all. The total number of training samples is 805,263. There are 3,550 users, and each user owns 226.8 samples on average. We only use users (i.e., 355 users). For each user, we take of the samples to construct the global test set. We resize the images to .
For centralized training, we correspondingly use the training set and test set for the first six datasets. For FeMnist, we centralize users’ training samples as the training set. For decentralized training (i.e., FL), we split the training set of the first six datasets according to Dirichlet distributions as done in previous FL works [FedDF, NonIIDQuag, FedMA]. Specifically, we split the training set onto clients and each client’s label distribution is generated from . While for FeMnist, we directly take the 355 users as clients. Some of these datasets are utilized in previous FL works. For example, Cifar10/Cifar100/Cinic10 are recommended by FedML [FedML], and FeMnist is recommended by LEAF [LEAF].
Appendix B Network Details
We utilize MLP, VGG [VGG], ResNet [ResNet] in this paper. We detail their architectures as follows:

MLP
denotes a multiple layer perceptron with four layers containing input and output layers. For Mnist and FeMnist, the input size is
. MLP has the architecture: FC1(784, 1024), ReLU(), FC2(1024, 1024), ReLU(), FC3(1024, 1024), ReLU(), FC4(1024, ). denotes the number of classes. 
VGG contains a series of networks with various layers. The paper of VGG [VGG]
presents VGG11, VGG13, VGG16, and VGG19. We follow their architectures and report the configuration of VGG11 as an example: 64, M, 128, M, 256, 256, M, 512, 512, M, 512, 512, M. “M” denotes the maxpooling layer. VGG11 contains 8 convolution blocks and three fullyconnected layers in
[VGG]. However, we only use one fullyconnected layer for classification in this paper. VGG9 is commonly utilized in previous FL works [FedMA, FedDF], whose configuration is: 32, 64, M, 128, 128, M, 256, 256, M. We keep all the fullyconnected layers in VGG9 for a fair comparison with other works. The three fullyconnected layers in VGG9 are: FC(4096, 512), ReLU(), FC(512, 512), ReLU(), FC(512, ). We name the th convolution layer in VGG as “Conv”. We do not use BatchNorm [BN] in VGG by default. 
ResNet
introduces residual connections to plain neural networks. We take the Cifar versions used in the paper
[ResNet], i.e., ResNet20 with the basic block. We set the initial channel as 64 (i.e., the output channel of the first convolution layer), and take nine continual basic blocks with 64, 64, 64, 128, 128, 128, 256, 256, 256 channels, respectively. We add a fullyconnected layer for classification. We use BatchNorm [BN] in ResNet20 and add it before ReLU activation.
Appendix C Hyperparameter Details
For both centralized training and decentralized training (i.e., FL), we take a constant learning rate without scheduling, although some works have pointed out decaying the learning rate will help in FL [FLSchedule]. We take SGD with momentum 0.9 as the optimizer by default if without more declaration. For MLP and VGG networks, we set the learning rate as 0.05; for ResNet, we use 0.1. We respectively use a warm start with 100 training steps and 10 training steps for centralized training and decentralized training (during local training). We use batch size 10 for FeMnist and 64 for other datasets.
We use FedAvg [FedAvg], FedProx [FedProx], FedOpt [FedOpt], Scaffold [Scaffold], and MOON [MOON] as base FL algorithms. For all of these algorithms, we take communication rounds, and select clients during each round. Each client updates the global model on their private data for epochs. For FedProx, the regularization coefficient of the proximal term is tuned in and the best one is reported. For FedOpt, we take SGD with momentum 0.9 as the global optimizer, and tune the global learning rate in , which is similar to FedAvgM [FedAvgM]. We also try using Adam as the global optimizer and find the performances are not stable. For Scaffold, we use the implementation from the online page ^{2}^{2}2https://github.com/ramshi236/AcceleratedFederatedLearningOverMACinHeterogeneousNetworks. For MOON, we set the coefficient of the contrastive loss as , which is recommended by the authors. We then replace the normal neurons with the proposed PANs to improve these algorithms. We keep by default and tune hyperparameters from: PAN with , PAN with , PAN with .
FeMnist  GTSRB  SVHN  Cifar10  Cifar100  Cinic10  

MLP  VGG9  VGG9  VGG11  ResNet20  ResNet20  
SGD + Momentum=0.9 (LR in {0.05,0.1})  53.39  86.96  89.93  84.57  70.82  82.76 
Adam (LR=3e4)  54.25  90.84  91.13  87.13  67.22  81.99 
Appendix D Experimental Details
d.1 Shuffle Test and Shuffle Test in FL
We propose a procedure to measure the degree of permutation invariance of a certain neural network, that is, how large the shuffle error is after shuffling the neurons. The shuffle process is shown in Alg. 1, where controls the disorder level of the constructed permutation matrices. Some additional descriptions are: (1) the permutation matrix (PM) should be randomly generated and we don’t need to solve it; (2) PMs are introduced just to verify the property of PANs that they can disable the permutation invariance of neural networks, which is not used in our FedPAN algorithm; (3) the computational complexity is , requiring at most J swaps, which is very efficient to implement during simulation.
We introduce the shuffle test in the body of this paper. Specifically, we manually shuffle the network and study the output change, i.e., the shuffle error defined in the body. A hyperparameter is used to control the disorder of permutation. Given a , we could generate a permutation matrix , then we calculate how many neurons are not shuffled via computing “=np.mean(np.diag())”. We use the functions provided in the Numpy ^{3}^{3}3https://numpy.org/ package. This is calculated and its correspondence to is shown in the body. The shuffle process is also applied to FL. Specifically, we present the PseudoCode in Alg. 2. Easily, the model will be shuffled for times during local training in expectation. Hence, we calculate the corresponding as the diagonal ones after several accumulative permutation, i.e., “=np.mean(np.diag())”, where , , and denote the generated permutation matrices in each local update step. We simulate the process for a single layer 10 times and calculate the averaged . We keep and show the relations of and in the body.
Appendix E Additional Experimental Results
Shuffle Error on Random Data: We investigate the shuffle error via taking the random data as input in the body, where we only present the results based on VGG13. We report similar results on MLP and ResNet20, which are shown in Fig. 14 and Fig. 15. Multiplicative PANs with a larger A make the network more sensitive to neuron permutation.
Weight Divergence: Our proposed PANs could decrease the weight divergence during FL. Specifically, we split the training data onto clients with and select all clients in each round, i.e., . We take communication rounds and then calculate the local gradient variance as an approximation. We vary the number of local epochs . We only report the results on Mnist with in the body. Additional results of Mnist with (Fig. 16) and Cifar10 with (Fig. 17) further verify that PANs could decrease the local gradient variance.
Matching via Optimal Assignment: We first train a global model via FL for communication rounds, where the scene contains clients with . Then, we randomly sample a local client and update the global model for epochs. Our goal is to search for a matrix to match the neurons of the global model and the updated one, i.e., the local model of this client. We then use 500 test samples to obtain the neuron’s activations as their representations. Hence, the optimal assignment problem could be solved and the assignment matrix is a permutation matrix. The results on various layers of VGG9 and MLP are shown in Fig. 18 and Fig. 19. Notably, the calculated matching ratio, i.e., the number in “[]”, is only an approximated value which represents how much neurons are shuffled. The absolute value (e.g., 0.062) does not represent the actual permutation during training.
Visualizing Neurons via Preference Vectors: Similarly, more of the visualization results via preference vectors of neurons are provided in Fig. 20, Fig. 21, Fig. 22, and Fig. 23. Notably, there are only 10 neurons in Fig. 22 because FC4 is the output layer with 10 classes. Using PANs could encourage neurons at the same position contribute to the same classes as much as possible.
Universal Application of PANs: We report the results of applying PANs to popular FL algorithms on FeMnist, Cifar10, Cifar100, and Cinic10 in the body. We show the results on SVHN and GTSRB in Fig. 24. Training on GTSRB is not stable, and some algorithms will converge slower, e.g., FedAvg and FedOpt. This could be improved with the additional effort of tuning learning rates, while we omit this in this paper. Comparison results on Cifar10 and Cifar100 under various levels of noni.i.d. data are shown in Fig. 25 and Fig. 26. The improvements under various scenes based on Scaffold are shown in Fig. 27. These additional results further verify the universal application of PANs to improve the performance of FL.
Hyperparameter Analysis: We present the performances of various with PAN when in the body and point out that setting is a good choice. Here, we present a more comprehensive analysis with both additive and multiplicative PANs. The used FL scene is: , , , , . We plot the results on Cifar10 with VGG11 and Cifar100 with ResNet20 in Fig 28 and Fig. 29. The leftmost point shows the baseline of the performance. The four parts in different colors show the results with various or , while the other one is fixed. For example, the first part shows the performances with in PAN, while is fixed to 0.05. Clearly, with fixed , a larger leads to degradation (the green and the red part). Setting around 0.1 for PAN is recommended. The results on Cifar100 are more invariant to , although the performances fluctuate a lot on Cifar10. Many of these hyperparameters could surpass the baseline.
Model fusion of MLP on Mnist (Left) and VGG9 on Cifar10 (Right) with direct parameter averaging, optimal transport, and PANs. The xaxis shows the interpolation coefficient.
Appendix F More Studies
f.1 Centralized Training
We report the test accuracies of centralized training on FeMnist, GTSRB, SVHN, Cifar10, Cifar100, and Cinic10. The utilized networks are correspondingly MLP, VGG9, VGG9, VGG11, ResNet20, and ResNet20. The numbers of training epochs are respectively 30, 20, 30, 30, 100, and 100. We utilize both SGD with momentum 0.9 and Adam as the optimizer. For SGD, we use 0.05 as the learning rate for MLP and VGG, while 0.1 for ResNet20. For Adam, we use 0.0003 for all networks. The performances are listed in Tab. 3. We then add PANs to some datasets and find that the performances degrade slightly. We vary the hyperparameter in PANs while keeping . The results are shown in Fig. 30. Using PANs could harm the training process slightly, and commonly, a larger could make the results worse. Although we try utilizing the adaptive optimizer (i.e., Adam), the results of utilizing PANs do not improve. Advanced optimizers should be proposed to mitigate the degradation, which is left for future work.
f.2 Optimal Transport for Model Fusion
FL should send down the global model to local clients as the initialization during each communication round. If not, coordinatebased parameter averaging will become worse. The work [OTFusion] studies model fusion with different initializations, and utilizes optimal transport [Barycenter] to align model parameters. We split Mnist and Cifar10 into two parts uniformly. We train independent models on these two sets correspondingly. The obtained models after training epochs are denoted as and . Then, an interpolation is evaluated, i.e., , . Directly averaging these two models will perform poorly, which is shown in Fig 31 (the line with legend “Avg”). If we align the models via optimal transport and then interpolate the aligned models, the results become better (the line with legend “OT+Avg” in Fig 31). We further add PANs during model training and the performances could be slightly improved (the line with legend “PANs+OT+Avg” in Fig 31). This shows that PANs may still be helpful with different initializations.
f.3 BatchNorm vs. GroupNorm
We then investigate the normalization techniques in deep neural networks. Previous FL works point out that GroupNorm may be more applicable to FL with noniid data [NonIIDQuag]. Specifically, BatchNorm calculates the mean and variance of a data batch, which is relevant to local training data. Hence, the statistical information in BatchNorm will diverge a lot across clients. One solution is aggregating the statistical information during FL, i.e., averaging the “running mean” and “running variance” in BatchNorm. We denote this as “BNY”. In contrast, we use “BYN” to represent the method that “running mean” and “running variance” are not aggregated. We also vary the number of groups in GroupNorm, i.e., , which are denoted as “GN1”, “GN2”, “GN8”, and “GN32”. We list the convergence curves on Cifar10 and Cifar100 in Fig. 32. We use VGG11 and ResNet20 as the backbone. The numbers in the legends denote the final test accuracies. GroupNorm only improves the performances of Cifar10 with VGG11. Additionally, setting the number of groups as 1 is better. We also apply PANs to networks with “GN1” and find the performance does not improve. The combination of PANs with various normalization techniques is also interesting, which is also left for future work.
f.4 Personalization in FL
Finally, we present some possible varieties of PANs for personalization in FL. In the body of this paper, we take the same position encodings among clients and implicitly make neurons combined with their positions. However, if we take different position encodings or partially shared position encodings among clients, we could let similar clients contribute more. Some clients own individual positions, which could be utilized for personalization. These ideas are also left for future work.